Problem Statement¶

Business Context¶

Business communities in the United States are facing high demand for human resources, but one of the constant challenges is identifying and attracting the right talent, which is perhaps the most important element in remaining competitive. Companies in the United States look for hard-working, talented, and qualified individuals both locally as well as abroad.

The Immigration and Nationality Act (INA) of the US permits foreign workers to come to the United States to work on either a temporary or permanent basis. The act also protects US workers against adverse impacts on their wages or working conditions by ensuring US employers' compliance with statutory requirements when they hire foreign workers to fill workforce shortages. The immigration programs are administered by the Office of Foreign Labor Certification (OFLC).

OFLC processes job certification applications for employers seeking to bring foreign workers into the United States and grants certifications in those cases where employers can demonstrate that there are not sufficient US workers available to perform the work at wages that meet or exceed the wage paid for the occupation in the area of intended employment.

Objective¶

In FY 2016, the OFLC processed 775,979 employer applications for 1,699,957 positions for temporary and permanent labor certifications. This was a nine percent increase in the overall number of processed applications from the previous year. The process of reviewing every case is becoming a tedious task as the number of applicants is increasing every year.

The increasing number of applicants every year calls for a Machine Learning based solution that can help in shortlisting the candidates having higher chances of VISA approval. OFLC has hired the firm EasyVisa for data-driven solutions. You as a data scientist at EasyVisa have to analyze the data provided and, with the help of a classification model:

  • Facilitate the process of visa approvals.
  • Recommend a suitable profile for the applicants for whom the visa should be certified or denied based on the drivers that significantly influence the case status.

Data Description¶

The data contains the different attributes of employee and the employer. The detailed data dictionary is given below.

  • case_id: ID of each visa application
  • continent: Information of continent the employee
  • education_of_employee: Information of education of the employee
  • has_job_experience: Does the employee has any job experience? Y= Yes; N = No
  • requires_job_training: Does the employee require any job training? Y = Yes; N = No
  • no_of_employees: Number of employees in the employer's company
  • yr_of_estab: Year in which the employer's company was established
  • region_of_employment: Information of foreign worker's intended region of employment in the US.
  • prevailing_wage: Average wage paid to similarly employed workers in a specific occupation in the area of intended employment. The purpose of the prevailing wage is to ensure that the foreign worker is not underpaid compared to other workers offering the same or similar service in the same area of employment.
  • unit_of_wage: Unit of prevailing wage. Values include Hourly, Weekly, Monthly, and Yearly.
  • full_time_position: Is the position of work full-time? Y = Full Time Position; N = Part Time Position
  • case_status: Flag indicating if the Visa was certified or denied

Installing and Importing the necessary libraries¶

In [1]:
# Check Python version
import sys
print(sys.version)
3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0]
In [2]:
# Upgrade build tools first
%pip install -q --upgrade pip setuptools wheel

# Py3.12-compatible pins
%pip install -q \
  numpy==2.0.2 \
  pandas==2.2.2 \
  matplotlib==3.8.4 \
  seaborn==0.13.2 \
  scikit-learn==1.6.1 \
  sklearn-pandas==2.2.0 \
  xgboost==2.0.3
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the below.

In [3]:
# Core data science stack
try:
    import os
    import sys
    import pandas as pd
    import numpy as np, pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
except ImportError as e:
    print(f"[ERROR] Core package missing: {e.name}.")
    print("Solution: Install with pip, e.g.: !pip install pandas numpy matplotlib seaborn")

# Utilities
try:
    from pathlib import Path
   # from google.colab import drive
    from datetime import datetime
    from collections import Counter
except ImportError as e:
    print(f"[ERROR] Utility package missing: {e.name}.")
    print("Solution: If running outside Colab, remove or replace 'google.colab' imports.")

# Version verification
try:
    import pandas, numpy, matplotlib, seaborn, sklearn, xgboost
    print("\n[OK] Installed package versions:")
    print(f" * Python:     {sys.version.split()[0]}")
    print(f" * pandas:     {pandas.__version__}")
    print(f" * numpy:      {numpy.__version__}")
    print(f" * matplotlib: {matplotlib.__version__}")
    print(f" * seaborn:    {seaborn.__version__}")
    print(f" * scikit-learn: {sklearn.__version__}")
    print(f" * xgboost:    {xgboost.__version__}")
except Exception as e:
    print("[ERROR] Could not verify all package versions:", e)
    print("Double-check that packages are installed with pip/conda.")
[OK] Installed package versions:
 * Python:     3.10.12
 * pandas:     2.2.2
 * numpy:      2.0.2
 * matplotlib: 3.8.4
 * seaborn:    0.13.2
 * scikit-learn: 1.6.1
 * xgboost:    2.0.3

Import Dataset¶

In [4]:
# Primary WSL path Windows C: mounted at /mnt/c
p1 = Path("/mnt/c/Users/caste/OneDrive/Desktop/MLE/AI ML/EasyVisa.csv")
# Fallback in case
p2 = Path("/mnt/c/Users/caste/Desktop/MLE/AI ML/EasyVisa.csv")

data_path = p1 if p1.exists() else (p2 if p2.exists() else None)

print("[INFO] Candidates:")
print("  -", p1)
print("  -", p2)

if data_path is None:
    raise FileNotFoundError(
        "[ERROR] Could not find EasyVisa.csv at either location above.\n"
        "Checks:\n"
        "  • Confirm the file is not cloud-only in OneDrive.\n"
        "  • Verify the path inside WSL\n"
        "  • If stored elsewhere, update the path here."
    )

print(f"[OK] Using: {data_path}")

# Loader with encoding fallback
def read_csv_safely(path: Path, **kwargs):
    try:
        return pd.read_csv(path, **kwargs)
    except UnicodeDecodeError:
        print(f"[WARN] UnicodeDecodeError for {path.name}. Retrying with latin-1 …")
        return pd.read_csv(path, encoding="latin-1", engine="python", **kwargs)
    except pd.errors.ParserError as e:
        print(f"[WARN] ParserError for {path.name}: {e}\nTrying engine='python' with on_bad_lines='skip'.")
        return pd.read_csv(path, engine="python", on_bad_lines="skip", **kwargs)

# Load
df = read_csv_safely(data_path)
print(f"[OK] Loaded {data_path.name}")
[INFO] Candidates:
  - /mnt/c/Users/caste/OneDrive/Desktop/MLE/AI ML/EasyVisa.csv
  - /mnt/c/Users/caste/Desktop/MLE/AI ML/EasyVisa.csv
[OK] Using: /mnt/c/Users/caste/OneDrive/Desktop/MLE/AI ML/EasyVisa.csv
[OK] Loaded EasyVisa.csv

Overview of the Dataset¶

View the first and last 5 rows of the dataset¶

In [5]:
# Dataset first 5 rows
display(df.head()
        .style.set_properties(**{'text-align': 'left'}).set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
        )
  case_id continent education_of_employee has_job_experience requires_job_training no_of_employees yr_of_estab region_of_employment prevailing_wage unit_of_wage full_time_position case_status
0 EZYV21644 Europe Bachelor's N Y 3126 1800 South 192806.060000 Year Y Certified
1 EZYV24364 Asia Master's Y N 1649 1800 Midwest 148321.570000 Year Y Certified
2 EZYV22098 Asia Bachelor's Y N 2043 1800 Northeast 145985.430000 Year Y Certified
3 EZYV12103 Asia High School N Y 808 1800 West 127303.960000 Year Y Denied
4 EZYV13156 Asia Master's Y N 573 1800 Northeast 124457.500000 Year Y Certified
In [6]:
# Dataset last 5 rows
display(df.tail()
        .style.set_properties(**{'text-align': 'left'}).set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
        )
  case_id continent education_of_employee has_job_experience requires_job_training no_of_employees yr_of_estab region_of_employment prevailing_wage unit_of_wage full_time_position case_status
25475 EZYV5163 Asia Bachelor's Y N 500 2016 Northeast 12476.400000 Year Y Certified
25476 EZYV15379 Europe Doctorate N N 41 2016 Northeast 9442.230000 Year Y Certified
25477 EZYV15437 Europe Bachelor's Y N 1338 2016 West 639.957100 Hour Y Denied
25478 EZYV20413 Asia High School N N 1655 2016 Northeast 538.996600 Hour Y Denied
25479 EZYV8640 Europe Master's N N 2089 2016 West 399.629700 Hour Y Denied

Understand the shape of the dataset¶

In [7]:
# Print the rows and coulmns
print(f"\n[INFO] Dataset shape: {df.shape}")
[INFO] Dataset shape: (25480, 12)

Check the data types of the columns for the dataset¶

In [8]:
# Dataset info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25480 entries, 0 to 25479
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   case_id                25480 non-null  object 
 1   continent              25480 non-null  object 
 2   education_of_employee  25480 non-null  object 
 3   has_job_experience     25480 non-null  object 
 4   requires_job_training  25480 non-null  object 
 5   no_of_employees        25480 non-null  int64  
 6   yr_of_estab            25480 non-null  int64  
 7   region_of_employment   25480 non-null  object 
 8   prevailing_wage        25480 non-null  float64
 9   unit_of_wage           25480 non-null  object 
 10  full_time_position     25480 non-null  object 
 11  case_status            25480 non-null  object 
dtypes: float64(1), int64(2), object(9)
memory usage: 2.3+ MB

Exploratory Data Analysis (EDA)¶

Learned to set the data up before hand to avoid pitfalls later down the pipeline

  • Rock-solid config & verification (RNG, target integrity, reproducibility)

  • Canonical maps for messy strings (target, Y/N flags, categories)

  • Schema guard prints unknown headers, missing canon columns, and collisions

  • Single build_canonical_df(df) entrypoint returning a clean, model-ready frame

  • Stratified split sanity and class-balance report

In [9]:
# A) CONFIG & VERIFICATION (Pro)
import re, difflib
from typing import Dict, Any

# --- Config ---
RNG = 72
np.random.seed(RNG)

TARGET_RAW = "case_status"     # raw target column in EasyVisa
TARGET_LBL = "_case_label"     # canonical string label: 'Certified' / 'Denied'
TARGET_BIN = "y"               # numeric target: 1/0

# Minimal raw columns expected before any feature engineering
EXPECTED_COLS = {
    "case_id","continent","education_of_employee","has_job_experience",
    "requires_job_training","no_of_employees","yr_of_estab",
    "region_of_employment","prevailing_wage","unit_of_wage",
    "full_time_position","case_status"
}

# --- Utilities ---
def verify_rng(seed: int = RNG) -> None:
    """Deterministically verify RNG reproducibility."""
    a = np.random.RandomState(seed).rand(5)
    b = np.random.RandomState(seed).rand(5)
    assert np.allclose(a, b), "RNG reproducibility failed"
    print("[OK] RNG reproducibility")

def verify_target_exists(df: pd.DataFrame) -> None:
    """Ensure the raw target column is present."""
    assert TARGET_RAW in df.columns, f"Missing target column `{TARGET_RAW}`"
    print("[OK] Target column present")

def report_class_balance(y: pd.Series, title: str = "[INFO] class balance") -> Dict[str, Any]:
    """Print and return class balance + imbalance ratio (if binary)."""
    vc = y.value_counts(dropna=False)
    ratio = float(vc.max()/vc.min()) if (len(vc) == 2 and vc.min() > 0) else float("nan")
    print(f"{title} →")
    print(vc.to_string())
    if np.isfinite(ratio):
        print(f"  Imbalance ratio ≈ {ratio:.2f}:1")
    return {"counts": vc.to_dict(), "ratio": ratio}

def verify_expected_columns(df: pd.DataFrame) -> Dict[str, Any]:
    """Check presence of all expected raw columns; print concise status."""
    missing = sorted([c for c in EXPECTED_COLS if c not in df.columns])
    extra   = sorted([c for c in df.columns if c not in EXPECTED_COLS])
    if missing:
        print(f"[WARN] Missing expected columns: {missing}")
    else:
        print("[OK] All expected columns present")
    if extra:
        print(f"[INFO] Extra columns (not required): {extra[:10]}{' …' if len(extra) > 10 else ''}")
    return {"missing": missing, "extra": extra, "passed": len(missing) == 0}

def verify_no_nans(series: pd.Series, name: str) -> None:
    """Hard guard to ensure no NaNs in a required series."""
    n = int(series.isna().sum())
    print(f"[VERIFY] {name}: NaN count = {n}")
    assert n == 0, f"Found NaNs in `{name}`"

# --- One-call dataset verifier ---
def verify_dataset(df: pd.DataFrame, name: str = "EasyVisa", strict: bool = False) -> Dict[str, Any]:
    """
    End-to-end verification for the raw EasyVisa frame.
    - RNG reproducibility
    - Target column exists
    - Expected raw columns present
    - (Optional) class balance if a canonical/encoded target is available
    Returns a dict report; raises if strict=True and checks fail.
    """
    print(f"\n=== VERIFICATION START: {name} ===")
    verify_rng()
    verify_target_exists(df)
    schema_rep = verify_expected_columns(df)

    # If the notebook has already created canonical target columns, report them
    report_lbl = {}
    report_bin = {}
    if TARGET_LBL in df.columns:
        report_lbl = report_class_balance(df[TARGET_LBL], "[INFO] target label balance")
    if TARGET_BIN in df.columns:
        report_bin = report_class_balance(df[TARGET_BIN], "[INFO] target numeric balance")

    passed = bool(schema_rep["passed"])
    summary = {
        "dataset": name,
        "rows": int(len(df)),
        "cols": int(df.shape[1]),
        "expected_ok": passed,
        "missing": schema_rep["missing"],
        "extra": schema_rep["extra"],
        "label_balance": report_lbl,
        "numeric_balance": report_bin,
        "passed": passed,
    }

    status = "PASS ✓" if summary["passed"] else "FAIL ✗"
    print(f"[SUMMARY] rows={summary['rows']} cols={summary['cols']} | "
          f"expected_ok={summary['expected_ok']}  ⇒  {status}")
    print(f"=== VERIFICATION END: {name} ===\n")

    if strict and not summary["passed"]:
        raise AssertionError(f"[{name}] Verification failed: {summary}")

    return summary

report = verify_dataset(df, name="EasyVisa Raw", strict=True)
=== VERIFICATION START: EasyVisa Raw ===
[OK] RNG reproducibility
[OK] Target column present
[OK] All expected columns present
[SUMMARY] rows=25480 cols=12 | expected_ok=True  ⇒  PASS ✓
=== VERIFICATION END: EasyVisa Raw ===

In [10]:
# B) CANONICALIZATION 
from typing import Dict, Any, Tuple
import difflib, re

# Helpers
_norm = lambda s: re.sub(r"[^a-z0-9]", "", str(s).lower())

_CANON_MAP: Dict[str,str] = {
    "caseid": "case_id",
    "continent": "continent",
    "educationofemployee": "education_of_employee",
    "hasjobexperience": "has_job_experience",
    "requiresjobtraining": "requires_job_training",
    "noofemployees": "no_of_employees",
    "yrofestab": "yr_of_estab",
    "regionofemployment": "region_of_employment",
    "prevailingwage": "prevailing_wage",
    "unitofwage": "unit_of_wage",
    "fulltimeposition": "full_time_position",
    "casestatus": "case_status",
}

def canon_case_status(s: pd.Series) -> pd.Series:
    s = s.astype("string").str.strip().str.lower()
    m = {
        "certified": "Certified",
        "certified-expired": "Certified",
        "denied": "Denied",
        "rejected": "Denied",
        "withdrawn": "Denied",
    }
    return s.map(m)

def canon_yn(s: pd.Series) -> pd.Series:
    s = s.astype("string").str.strip().str.upper().replace({"YES":"Y","NO":"N"})
    return s.where(s.isin(["Y","N"]), np.nan)

def canon_title(s: pd.Series) -> pd.Series:
    return s.astype("string").str.strip().str.title()

def canon_education(s: pd.Series) -> pd.Series:
    s = s.astype("string").str.strip().str.lower()
    s = s.str.replace("’","'", regex=False).str.replace(r"\s+"," ", regex=True)
    def map_one(x):
        if x is None or x == "nan": return np.nan
        if "high"      in x: return "High School"
        if "bachelor"  in x: return "Bachelor's"
        if "master"    in x: return "Master's"
        if "doctor" in x or "phd" in x: return "Doctorate"
        return "Other"
    return s.map(map_one)

# Internal columns ignore schema warnings 
INTERNAL_COLS = {
    "_case_label","_edu","_continent","_region","_uow",
    "_job_exp","_job_train","_full_time"
}

# Column-level canonicalization 
def canonicalize_columns(df: pd.DataFrame, name: str = "frame") -> Tuple[pd.DataFrame, Dict[str, Any]]:
    """Rename raw columns using _CANON_MAP, report unknowns/missing/collisions."""
    raw = list(df.columns)
    keys = [_norm(c) for c in raw]

    # rename with collision protection
    ren, used = {}, set()
    for raw_col, k in zip(raw, keys):
        new = _CANON_MAP.get(k, raw_col)
        if new in used and new != raw_col:
            new = raw_col  # avoid overwrite
        ren[raw_col] = new
        used.add(new)
    out = df.rename(columns=ren).copy()

    # unknown headers skip internals expected names are the canonical keys’ values
    expected = set(_CANON_MAP.values())
    unknown = [(r, k) for r, k in zip(raw, keys)
               if r not in INTERNAL_COLS and not str(r).startswith("_")
               and (k not in _CANON_MAP) and (r not in expected)]

    # missing expected canonical columns
    missing = sorted([c for c in expected if c not in out.columns])

    # collisions two raw normalize to same key
    seen, collisions = {}, {}
    for r, k in zip(raw, keys):
        if k in seen and seen[k] != r:
            collisions.setdefault(k, set()).update([seen[k], r])
        else:
            seen[k] = r

    # print audit
    if unknown:
        print(f"[{name}] UNKNOWN headers:", [r for r,_ in unknown])
        print(f"[{name}] Suggested _CANON_MAP patches:")
        for r,k in unknown:
            guess = difflib.get_close_matches(k, list(_CANON_MAP.keys()), n=1, cutoff=0.6)
            guess_str = guess[0] if guess else "<add-key>"
            print(f'    _CANON_MAP["{k}"] = "{r}"   # raw="{r}"')
    else:
        print(f"[{name}] No unknown headers. ✓")

    print(f"[{name}] MISSING expected canonical columns:", missing if missing else "None ✓")

    if collisions:
        print(f"[{name}] COLLISIONS:")
        for k, raws in collisions.items():
            print(f"    norm='{k}' ← raw {sorted(list(raws))}")
    else:
        print(f"[{name}] No collisions. ✓")

    report = {
        "dataset": name,
        "unknown": [r for r,_ in unknown],
        "unknown_count": len(unknown),
        "missing": missing,
        "missing_count": len(missing),
        "collisions": {k: sorted(list(v)) for k,v in collisions.items()},
        "collision_count": len(collisions),
        "passed": (len(unknown) == 0 and len(missing) == 0 and len(collisions) == 0)
    }
    status = "PASS ✓" if report["passed"] else "WARN ✗"
    print(f"[{name}] COLUMN SUMMARY → unknown={report['unknown_count']} | "
          f"missing={report['missing_count']} | collisions={report['collision_count']} ⇒ {status}")
    return out, report

# Value-level canonicalization 
def apply_canonical_values(df: pd.DataFrame, name: str = "frame") -> Tuple[pd.DataFrame, Dict[str, Any]]:
    """Create canonical helper columns, coerce numerics, and print a validation summary."""
    out = df.copy()

    # target labels and numeric target
    out["_case_label"] = canon_case_status(out["case_status"])
    out["y"] = out["_case_label"].map({"Certified":1, "Denied":0}).astype("Int64")

    # Y/N flags
    out["_job_exp"]   = canon_yn(out["has_job_experience"])
    out["_job_train"] = canon_yn(out["requires_job_training"])
    out["_full_time"] = canon_yn(out["full_time_position"])

    # categoricals
    out["_continent"] = canon_title(out["continent"])
    out["_region"]    = canon_title(out["region_of_employment"])
    out["_uow"]       = canon_title(out["unit_of_wage"])
    out["_edu"]       = canon_education(out["education_of_employee"])

    # numerics keep raw wage normalization 
    out["no_of_employees"] = pd.to_numeric(out["no_of_employees"], errors="coerce")
    out["yr_of_estab"]     = pd.to_numeric(out["yr_of_estab"], errors="coerce")
    out["prevailing_wage"] = pd.to_numeric(out["prevailing_wage"], errors="coerce")

    # Validation report 
    # Validation report 
    rep: Dict[str, Any] = {"dataset": name}

    # target check
    lbl_counts = out["_case_label"].value_counts(dropna=False).to_dict()
    y_counts   = out["y"].value_counts(dropna=False).to_dict()
    rep["target_label_counts"]  = lbl_counts
    rep["target_numeric_counts"] = y_counts
    rep["target_na"] = int(out["_case_label"].isna().sum())

    # Y/N validity
    def yn_invalid(col: str) -> int:
        # True where value is NOT in {Y,N} or is NA
        mask_valid = out[col].isin(["Y","N"])
        invalid = (~mask_valid) | mask_valid.isna()
        return int(invalid.sum())

    rep["job_exp_invalid"]   = yn_invalid("_job_exp")
    rep["job_train_invalid"] = yn_invalid("_job_train")
    rep["full_time_invalid"] = yn_invalid("_full_time")

    # education distribution flag 'Other' share if present
    edu_counts = out["_edu"].value_counts(dropna=False)
    rep["edu_counts"] = edu_counts.to_dict()
    rep["edu_other_pct"] = float((edu_counts.get("Other", 0) / max(1, edu_counts.sum())) * 100)

    # Print concise audit
    print(f"[{name}] TARGET label counts:", lbl_counts)
    print(f"[{name}] TARGET numeric counts:", y_counts)
    print(f"[{name}] TARGET NaNs: {rep['target_na']}")
    print(f"[{name}] YN invalid → job_exp={rep['job_exp_invalid']} | job_train={rep['job_train_invalid']} | full_time={rep['full_time_invalid']}")
    print(f"[{name}] Education 'Other' share: {rep['edu_other_pct']:.1f}%")

    # pass criteria no target NaNs & YN all valid
    rep["passed"] = (rep["target_na"] == 0 and 
                    rep["job_exp_invalid"] == 0 and 
                    rep["job_train_invalid"] == 0 and 
                    rep["full_time_invalid"] == 0)

    status = "PASS ✓" if rep["passed"] else "WARN ✗"
    print(f"[{name}] VALUE SUMMARY ⇒ {status}")
    return out, rep

# One call convenience
def canonicalize_and_validate(df: pd.DataFrame, name: str = "EasyVisa", strict: bool = False) -> Tuple[pd.DataFrame, Dict[str, Any], Dict[str, Any]]:
    """
    Column rename + value canonicalization with full validation.
    Returns (DF, column_report, value_report).
    """
    df1, col_rep = canonicalize_columns(df, name=f"{name}-cols")
    DF, val_rep  = apply_canonical_values(df1, name=f"{name}-vals")

    all_pass = col_rep["passed"] and val_rep["passed"]
    final_status = "PASS ✓" if all_pass else "WARN ✗"
    print(f"[{name}] CANONICALIZATION SUMMARY → columns={col_rep['passed']} & values={val_rep['passed']} ⇒ {final_status}\n")

    if strict and not all_pass:
        raise AssertionError(f"[{name}] Canonicalization failed: col={col_rep}, val={val_rep}")

    return DF, col_rep, val_rep

# Build canonical frame with verification
DF, col_report, val_report = canonicalize_and_validate(df, name="EasyVisa", strict=True)
[EasyVisa-cols] No unknown headers. ✓
[EasyVisa-cols] MISSING expected canonical columns: None ✓
[EasyVisa-cols] No collisions. ✓
[EasyVisa-cols] COLUMN SUMMARY → unknown=0 | missing=0 | collisions=0 ⇒ PASS ✓
[EasyVisa-vals] TARGET label counts: {'Certified': 17018, 'Denied': 8462}
[EasyVisa-vals] TARGET numeric counts: {np.int64(1): 17018, np.int64(0): 8462}
[EasyVisa-vals] TARGET NaNs: 0
[EasyVisa-vals] YN invalid → job_exp=0 | job_train=0 | full_time=0
[EasyVisa-vals] Education 'Other' share: 0.0%
[EasyVisa-vals] VALUE SUMMARY ⇒ PASS ✓
[EasyVisa] CANONICALIZATION SUMMARY → columns=True & values=True ⇒ PASS ✓

In [11]:
# C) RENAME AND SCHEMA GUARD
from typing import Dict, Any
import difflib

# helper/internal columns create during canonicalization
INTERNAL_COLS = {
    "_case_label","_edu","_continent","_region","_uow",
    "_job_exp","_job_train","_full_time"
}

def rename_canonical(df: pd.DataFrame) -> pd.DataFrame:
    """Rename columns using _CANON_MAP while avoiding alias collisions."""
    ren, used = {}, set()
    for c in df.columns:
        k = _norm(c)
        new = _CANON_MAP.get(k, c)
        # avoid overwriting when alias and real both exist
        if new in used and new != c:
            new = c
        ren[c] = new
        used.add(new)
    return df.rename(columns=ren).copy()

def schema_guard(df: pd.DataFrame, name: str = "frame", strict: bool = False) -> Dict[str, Any]:
    """
    Validate schema quality:
      - unknown headers (not in alias map or expected), ignoring internal helper cols
      - missing expected columns (after rename)
      - collisions (two raw headers normalize to same key)

    Returns a dict report; raises AssertionError if strict=True and validation fails.
    """
    raw  = list(df.columns)
    keys = [_norm(c) for c in raw]

    # Unknown headers skip internals and underscore-prefixed helpers 
    unknown = [
        (r, k) for r, k in zip(raw, keys)
        if (r not in INTERNAL_COLS)
        and (not str(r).startswith("_"))
        and (k not in _CANON_MAP)
        and (r not in EXPECTED_COLS)
    ]
    if unknown:
        print(f"[{name}] UNKNOWN headers:", [r for r,_ in unknown])
        print(f"[{name}] Suggested _CANON_MAP patches:")
        for r, k in unknown:
            guess = difflib.get_close_matches(k, list(_CANON_MAP.keys()), n=1, cutoff=0.6)
            guess_str = guess[0] if guess else "<add-key>"
            print(f'    _CANON_MAP["{k}"] = "{r}"   # raw="{r}"')
    else:
        print(f"[{name}] No unknown headers. ✓")

    # Missing expected columns after rename
    missing = sorted([c for c in EXPECTED_COLS if c not in df.columns])
    print(f"[{name}] MISSING expected columns:", missing if missing else "None ✓")

    # Collisions two raw → same normalized key
    seen, collisions = {}, {}
    for r, k in zip(raw, keys):
        if k in seen and seen[k] != r:
            collisions.setdefault(k, set()).update([seen[k], r])
        else:
            seen[k] = r
    if collisions:
        print(f"[{name}] COLLISIONS:")
        for k, raws in collisions.items():
            print(f"   norm='{k}' ← raw {sorted(list(raws))}")
    else:
        print(f"[{name}] No collisions. ✓")

    # Summary and machine readable report 
    report = {
        "dataset": name,
        "unknown": [r for r,_ in unknown],
        "unknown_count": len(unknown),
        "missing": missing,
        "missing_count": len(missing),
        "collisions": {k: sorted(list(v)) for k, v in collisions.items()},
        "collision_count": len(collisions),
        "passed": (len(unknown) == 0 and len(missing) == 0 and len(collisions) == 0),
    }
    status = "PASS ✓" if report["passed"] else "FAIL ✗"
    print(f"[{name}] SUMMARY → unknown={report['unknown_count']} | "
          f"missing={report['missing_count']} | collisions={report['collision_count']}  ⇒  {status}")

    if strict and not report["passed"]:
        raise AssertionError(f"[{name}] Schema validation failed: {report}")

    return report

# After loading raw df
df_renamed = rename_canonical(df)

# Print audit and get structured report
rep = schema_guard(df_renamed, name="EasyVisa-cols", strict=True)
[EasyVisa-cols] No unknown headers. ✓
[EasyVisa-cols] MISSING expected columns: None ✓
[EasyVisa-cols] No collisions. ✓
[EasyVisa-cols] SUMMARY → unknown=0 | missing=0 | collisions=0  ⇒  PASS ✓
In [12]:
# D) BUILD CANONICAL DF
from typing import Dict, Any, Tuple

def build_canonical_df(df: pd.DataFrame, name: str = "EasyVisa", strict: bool = False
                      ) -> Tuple[pd.DataFrame, Dict[str, Any]]:
    """
    Rename → canonicalize values → coerce numerics → validate.
    Returns (DF, report). If strict=True, raises on failed validation.
    """
    # 1) Column rename and schema audit
    out = rename_canonical(df).copy()
    col_rep = schema_guard(out, f"{name}-cols", strict=False)

    # 2) Target string and numeric
    out[TARGET_LBL] = canon_case_status(out[TARGET_RAW])
    out[TARGET_BIN] = out[TARGET_LBL].map({"Certified": 1, "Denied": 0}).astype("Int64")

    # 3) Binary flags
    out["_job_exp"]   = canon_yn(out["has_job_experience"])
    out["_job_train"] = canon_yn(out["requires_job_training"])
    out["_full_time"] = canon_yn(out["full_time_position"])

    # 4) Core categoricals
    out["_continent"] = canon_title(out["continent"])
    out["_region"]    = canon_title(out["region_of_employment"])
    out["_uow"]       = canon_title(out["unit_of_wage"])
    out["_edu"]       = canon_education(out["education_of_employee"])

    # 5) Numerics coerce safely wage normalization happens later
    out["no_of_employees"] = pd.to_numeric(out["no_of_employees"], errors="coerce")
    out["yr_of_estab"]     = pd.to_numeric(out["yr_of_estab"], errors="coerce")
    out["prevailing_wage"] = pd.to_numeric(out["prevailing_wage"], errors="coerce")

    # 6) Value level validation
    rep: Dict[str, Any] = {"dataset": name}

    # Target checks
    rep["target_label_counts"]  = out[TARGET_LBL].value_counts(dropna=False).to_dict()
    rep["target_numeric_counts"] = out[TARGET_BIN].value_counts(dropna=False).to_dict()
    rep["target_na"] = int(out[TARGET_LBL].isna().sum())
    rep["y_na"]      = int(out[TARGET_BIN].isna().sum())

    # Y/N validity
    def yn_invalid(col: str) -> int:
        m = out[col].isin(["Y", "N"])
        return int((~m | m.isna()).sum())

    rep["job_exp_invalid"]   = yn_invalid("_job_exp")
    rep["job_train_invalid"] = yn_invalid("_job_train")
    rep["full_time_invalid"] = yn_invalid("_full_time")

    # Numeric NA counts sanity only
    rep["no_of_employees_na"] = int(out["no_of_employees"].isna().sum())
    rep["yr_of_estab_na"]     = int(out["yr_of_estab"].isna().sum())
    rep["prevailing_wage_na"] = int(out["prevailing_wage"].isna().sum())

    # Pass criteria values — tune as needed
    rep["values_passed"] = (
        rep["target_na"] == 0 and rep["y_na"] == 0
        and rep["job_exp_invalid"] == 0
        and rep["job_train_invalid"] == 0
        and rep["full_time_invalid"] == 0
    )

    # 7) Final summary
    overall_pass = col_rep["passed"] and rep["values_passed"]
    rep["columns_passed"] = col_rep["passed"]
    rep["passed"] = overall_pass

    # Print concise summary
    print(f"[{name}-vals] TARGET NaNs: labels={rep['target_na']} | y={rep['y_na']}")
    print(f"[{name}-vals] Y/N invalid → job_exp={rep['job_exp_invalid']} | "
          f"job_train={rep['job_train_invalid']} | full_time={rep['full_time_invalid']}")
    print(f"[{name}-vals] Numeric NaNs → employees={rep['no_of_employees_na']} | "
          f"yr_of_estab={rep['yr_of_estab_na']} | wage={rep['prevailing_wage_na']}")

    status_cols = "PASS ✓" if col_rep["passed"]     else "FAIL ✗"
    status_vals = "PASS ✓" if rep["values_passed"]  else "FAIL ✗"
    status_all  = "PASS ✓" if overall_pass          else "FAIL ✗"
    print(f"[{name}] CANON SUMMARY → columns={status_cols} | values={status_vals}  ⇒  {status_all}\n")

    if strict and not overall_pass:
        raise AssertionError(f"[{name}] Canonicalization failed: columns={col_rep}, values={rep}")

    return out, rep
In [13]:
# ID Freeze Guard
ID_COL = "case_id"

# 1) Normalize & lock dtype
DF[ID_COL] = DF[ID_COL].astype("string").str.strip()

# 2) Basic integrity
assert DF[ID_COL].notna().all(), "Null IDs detected."
dups = DF[ID_COL].value_counts()
assert (dups > 1).sum() == 0, f"Duplicate IDs found: {dups[dups>1].head(10).to_dict()}"

# 3) Deterministic ordering (optional but helpful)
DF = DF.sort_values(ID_COL).reset_index(drop=True)

# 4) Freeze a copy never use as a feature
ID_FREEZE = DF[ID_COL].copy()     # keep alongside DF for merges / OOF / submissions
In [14]:
# Never include ID
DROP = {"case_id", "case_status", "_case_label", "y"}
X = DF.drop(columns=[c for c in DROP if c in DF.columns])  
y = DF["y"].to_numpy()
print(f"[OK] Built X (features) and y (target) with shapes {X.shape}, {y.shape}")
[OK] Built X (features) and y (target) with shapes (25480, 17), (25480,)
In [15]:
# E) RUN & REPORT 
# 1) Core verifications on the raw frame
verify_rng()
verify_target_exists(df)
raw_schema = verify_expected_columns(df)

# 2) Build canonical dataframe and value audit
DF, canon_report = build_canonical_df(df, name="EasyVisa", strict=True)

# 3) Target class balance labels and numeric
lbl_bal = report_class_balance(DF[TARGET_LBL], "[INFO] target label balance")
num_bal = report_class_balance(DF[TARGET_BIN], "[INFO] target numeric balance")

# 4) Quick canonical feature sanity uniques
key_cats = ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"]
uniques = {c: int(DF[c].dropna().nunique()) for c in key_cats}
print("\n[INFO] canonical categorical uniques:", uniques)

# 5) Compact PASS / FAIL footer
overall_pass = bool(raw_schema["passed"] and canon_report["passed"])
status = "PASS ✓" if overall_pass else "WARN ✗"
print("\n================ SUMMARY ================")
print(f"rows={len(DF):,}  cols={DF.shape[1]}  |  "
      f"raw_expected_ok={raw_schema['passed']}  "
      f"columns_passed={canon_report['columns_passed']}  "
      f"values_passed={canon_report['values_passed']}  ⇒  {status}")
print("========================================\n")

# 6) Machine readable bundle 
verification_bundle = {
    "raw_schema": raw_schema,
    "canonicalization": canon_report,
    "label_balance": lbl_bal,
    "numeric_balance": num_bal,
    "cat_uniques": uniques,
    "passed": overall_pass,
}

assert verification_bundle["passed"], f"Verification failed: {verification_bundle}"
[OK] RNG reproducibility
[OK] Target column present
[OK] All expected columns present
[EasyVisa-cols] No unknown headers. ✓
[EasyVisa-cols] MISSING expected columns: None ✓
[EasyVisa-cols] No collisions. ✓
[EasyVisa-cols] SUMMARY → unknown=0 | missing=0 | collisions=0  ⇒  PASS ✓
[EasyVisa-vals] TARGET NaNs: labels=0 | y=0
[EasyVisa-vals] Y/N invalid → job_exp=0 | job_train=0 | full_time=0
[EasyVisa-vals] Numeric NaNs → employees=0 | yr_of_estab=0 | wage=0
[EasyVisa] CANON SUMMARY → columns=PASS ✓ | values=PASS ✓  ⇒  PASS ✓

[INFO] target label balance →
_case_label
Certified    17018
Denied        8462
  Imbalance ratio ≈ 2.01:1
[INFO] target numeric balance →
y
1    17018
0     8462
  Imbalance ratio ≈ 2.01:1

[INFO] canonical categorical uniques: {'_edu': 4, '_continent': 6, '_region': 5, '_uow': 4, '_job_exp': 2, '_job_train': 2, '_full_time': 2}

================ SUMMARY ================
rows=25,480  cols=21  |  raw_expected_ok=True  columns_passed=True  values_passed=True  ⇒  PASS ✓
========================================

Let's check the statistical summary of the data¶

In [16]:
#Print the statistics
DF.describe(include='all').T
Out[16]:
count unique top freq mean std min 25% 50% 75% max
case_id 25480 25480 EZYV8640 1 NaN NaN NaN NaN NaN NaN NaN
continent 25480 6 Asia 16861 NaN NaN NaN NaN NaN NaN NaN
education_of_employee 25480 4 Bachelor's 10234 NaN NaN NaN NaN NaN NaN NaN
has_job_experience 25480 2 Y 14802 NaN NaN NaN NaN NaN NaN NaN
requires_job_training 25480 2 N 22525 NaN NaN NaN NaN NaN NaN NaN
no_of_employees 25480.0 NaN NaN NaN 5667.04321 22877.928848 -26.0 1022.0 2109.0 3504.0 602069.0
yr_of_estab 25480.0 NaN NaN NaN 1979.409929 42.366929 1800.0 1976.0 1997.0 2005.0 2016.0
region_of_employment 25480 5 Northeast 7195 NaN NaN NaN NaN NaN NaN NaN
prevailing_wage 25480.0 NaN NaN NaN 74455.814592 52815.942327 2.1367 34015.48 70308.21 107735.5125 319210.27
unit_of_wage 25480 4 Year 22962 NaN NaN NaN NaN NaN NaN NaN
full_time_position 25480 2 Y 22773 NaN NaN NaN NaN NaN NaN NaN
case_status 25480 2 Certified 17018 NaN NaN NaN NaN NaN NaN NaN
_case_label 25480 2 Certified 17018 NaN NaN NaN NaN NaN NaN NaN
y 25480.0 <NA> <NA> <NA> 0.667896 0.470977 0.0 0.0 1.0 1.0 1.0
_job_exp 25480 2 Y 14802 NaN NaN NaN NaN NaN NaN NaN
_job_train 25480 2 N 22525 NaN NaN NaN NaN NaN NaN NaN
_full_time 25480 2 Y 22773 NaN NaN NaN NaN NaN NaN NaN
_continent 25480 6 Asia 16861 NaN NaN NaN NaN NaN NaN NaN
_region 25480 5 Northeast 7195 NaN NaN NaN NaN NaN NaN NaN
_uow 25480 4 Year 22962 NaN NaN NaN NaN NaN NaN NaN
_edu 25480 4 Bachelor's 10234 NaN NaN NaN NaN NaN NaN NaN
In [17]:
# 1) Total missing values
DF.isna().sum().sort_values(ascending=False)

# 2) Split by type
DF.select_dtypes('number').isna().sum().sort_values(ascending=False)
DF.select_dtypes(exclude='number').isna().sum().sort_values(ascending=False)

# Hard guardrail
assert DF.isna().sum().sum() == 0, "Unexpected NaNs remain after cleaning."
In [18]:
# Numeric summary
num_cols = DF.select_dtypes('number').columns
num_desc = DF[num_cols].describe().T
num_desc['nulls'] = DF[num_cols].isna().sum().values
num_desc['null_pct'] = (DF[num_cols].isna().mean().values * 100).round(2)

# Categorical summary
cat_cols = DF.select_dtypes(exclude='number').columns
cat_desc = DF[cat_cols].describe().T  
cat_desc['nulls'] = DF[cat_cols].isna().sum().values
cat_desc['null_pct'] = (DF[cat_cols].isna().mean().values * 100).round(2)

display(num_desc, cat_desc) 
count mean std min 25% 50% 75% max nulls null_pct
no_of_employees 25480.0 5667.04321 22877.928848 -26.0 1022.0 2109.0 3504.0 602069.0 0 0.0
yr_of_estab 25480.0 1979.409929 42.366929 1800.0 1976.0 1997.0 2005.0 2016.0 0 0.0
prevailing_wage 25480.0 74455.814592 52815.942327 2.1367 34015.48 70308.21 107735.5125 319210.27 0 0.0
y 25480.0 0.667896 0.470977 0.0 0.0 1.0 1.0 1.0 0 0.0
count unique top freq nulls null_pct
case_id 25480 25480 EZYV8640 1 0 0.0
continent 25480 6 Asia 16861 0 0.0
education_of_employee 25480 4 Bachelor's 10234 0 0.0
has_job_experience 25480 2 Y 14802 0 0.0
requires_job_training 25480 2 N 22525 0 0.0
region_of_employment 25480 5 Northeast 7195 0 0.0
unit_of_wage 25480 4 Year 22962 0 0.0
full_time_position 25480 2 Y 22773 0 0.0
case_status 25480 2 Certified 17018 0 0.0
_case_label 25480 2 Certified 17018 0 0.0
_job_exp 25480 2 Y 14802 0 0.0
_job_train 25480 2 N 22525 0 0.0
_full_time 25480 2 Y 22773 0 0.0
_continent 25480 6 Asia 16861 0 0.0
_region 25480 5 Northeast 7195 0 0.0
_uow 25480 4 Year 22962 0 0.0
_edu 25480 4 Bachelor's 10234 0 0.0
In [19]:
# Overall missingness
miss = DF.isna().sum().sort_values(ascending=False)
print(miss[miss > 0].to_string())

# % missing
pct = (DF.isna().mean()*100).sort_values(ascending=False)
print("\n% missing (top 21):\n", pct.head(25).round(2))
Series([], )

% missing (top 21):
 case_id                  0.0
continent                0.0
education_of_employee    0.0
has_job_experience       0.0
requires_job_training    0.0
no_of_employees          0.0
yr_of_estab              0.0
region_of_employment     0.0
prevailing_wage          0.0
unit_of_wage             0.0
full_time_position       0.0
case_status              0.0
_case_label              0.0
y                        0.0
_job_exp                 0.0
_job_train               0.0
_full_time               0.0
_continent               0.0
_region                  0.0
_uow                     0.0
_edu                     0.0
dtype: float64
In [20]:
def norm_str(x):
    return (str(x)
            .replace("\u00a0"," ")
            .strip()
            .lower())

col_raw, col_mapped = "continent", "_continent"   # adjust if different

# 1) What failed to map? (only meaningful if col_mapped exists)
if col_mapped in DF.columns and col_raw in DF.columns:
    failed = (DF.loc[DF[col_mapped].isna(), col_raw]
                .dropna().astype(str).map(norm_str).value_counts())
    print(f"\nUnmapped raw values for {col_raw} -> {col_mapped}:\n", failed.head(50))
else:
    print(f"\n[WARN] Columns not found for diagnostic: {col_raw} / {col_mapped}")

# 2) Compare raw uniques vs *value-map* keys (if you have one)
raw_uniques = set(DF[col_raw].dropna().astype(str).map(norm_str).unique())

# Try to locate a VALUE mapping dict. Update these names to your actual objects:
# - COL_RENAME: raw->canonical column rename dict
# - VAL_MAPS: per-column value maps, e.g., {"continent": {"na": "North America", ...}, ...}
val_map = None
try:
    # preferred: a dedicated value-map container
    val_map = VAL_MAPS.get(col_raw)            # <-- change to your value-map structure name
except NameError:
    pass

if val_map is None:
    # sometimes people store all maps in one big CANON_MAP; try to pull dicts out safely
    try:
        maybe = CANON_MAP.get(col_raw)
        val_map = maybe if isinstance(maybe, dict) else None
    except NameError:
        val_map = None

if isinstance(val_map, dict):
    map_keys = set(norm_str(k) for k in val_map.keys())
    only_in_raw = sorted(raw_uniques - map_keys)
    print(f"\nRaw-only values (not in value map for {col_raw}):", only_in_raw[:50])
else:
    print(f"\n[INFO] No value-mapping dict registered for {col_raw} — "
          f"looks like you only have a column rename. If you expect value normalization, "
          f"define VAL_MAPS['{col_raw}'] = {{...}} and re-run.")
Unmapped raw values for continent -> _continent:
 Series([], Name: count, dtype: int64)

[INFO] No value-mapping dict registered for continent — looks like you only have a column rename. If you expect value normalization, define VAL_MAPS['continent'] = {...} and re-run.
In [21]:
def norm_str(x):
    return (str(x).replace("\u00a0"," ").strip().lower())

def cat_uniques_report(df, cols):
    for c in cols:
        if c in df.columns:
            raw = df[c].dropna().astype(str)
            print(f"\n[{c}] uniques: {raw.nunique()} (raw)")
            print("Top (raw):", raw.value_counts().head(10).to_dict())
            norm = raw.map(norm_str)
            print(f"[{c}] uniques after normalization: {norm.nunique()}")
            print("Top (norm):", norm.value_counts().head(10).to_dict())

cat_cols = [
    "_continent","_region","_uow","_full_time","_job_exp","_job_train","_edu"
]
cat_uniques_report(DF, cat_cols)
[_continent] uniques: 6 (raw)
Top (raw): {'Asia': 16861, 'Europe': 3732, 'North America': 3292, 'South America': 852, 'Africa': 551, 'Oceania': 192}
[_continent] uniques after normalization: 6
Top (norm): {'asia': 16861, 'europe': 3732, 'north america': 3292, 'south america': 852, 'africa': 551, 'oceania': 192}

[_region] uniques: 5 (raw)
Top (raw): {'Northeast': 7195, 'South': 7017, 'West': 6586, 'Midwest': 4307, 'Island': 375}
[_region] uniques after normalization: 5
Top (norm): {'northeast': 7195, 'south': 7017, 'west': 6586, 'midwest': 4307, 'island': 375}

[_uow] uniques: 4 (raw)
Top (raw): {'Year': 22962, 'Hour': 2157, 'Week': 272, 'Month': 89}
[_uow] uniques after normalization: 4
Top (norm): {'year': 22962, 'hour': 2157, 'week': 272, 'month': 89}

[_full_time] uniques: 2 (raw)
Top (raw): {'Y': 22773, 'N': 2707}
[_full_time] uniques after normalization: 2
Top (norm): {'y': 22773, 'n': 2707}

[_job_exp] uniques: 2 (raw)
Top (raw): {'Y': 14802, 'N': 10678}
[_job_exp] uniques after normalization: 2
Top (norm): {'y': 14802, 'n': 10678}

[_job_train] uniques: 2 (raw)
Top (raw): {'N': 22525, 'Y': 2955}
[_job_train] uniques after normalization: 2
Top (norm): {'n': 22525, 'y': 2955}

[_edu] uniques: 4 (raw)
Top (raw): {"Bachelor's": 10234, "Master's": 9634, 'High School': 3420, 'Doctorate': 2192}
[_edu] uniques after normalization: 4
Top (norm): {"bachelor's": 10234, "master's": 9634, 'high school': 3420, 'doctorate': 2192}
In [22]:
na_counts = DF.isna().sum().sort_values(ascending=False)
print(na_counts[na_counts > 0].to_string())
Series([], )
In [23]:
# Check unexpected units
uow_norm = DF["_uow"].astype(str).str.strip().str.lower()
bad_uow = uow_norm[~uow_norm.isin(["year","month","week","biweek","day","hour"])].value_counts()
print("\nUnexpected _uow values:\n", bad_uow.head(30))
Unexpected _uow values:
 Series([], Name: count, dtype: int64)
In [24]:
for c in ["case_status", "_case_label", "y"]:
    assert c in DF.columns, f"Missing: {c}"

print(DF[["_case_label","y"]].isna().sum())
print(DF["_case_label"].value_counts(dropna=False))
print(DF["y"].value_counts(dropna=False))
_case_label    0
y              0
dtype: int64
_case_label
Certified    17018
Denied        8462
Name: count, dtype: int64
y
1    17018
0     8462
Name: count, dtype: Int64
In [25]:
# 
display(DF.select_dtypes(include="number").describe().T)                
display(DF.select_dtypes(include=["object","category","bool"]).describe().T)  
count mean std min 25% 50% 75% max
no_of_employees 25480.0 5667.04321 22877.928848 -26.0 1022.0 2109.0 3504.0 602069.0
yr_of_estab 25480.0 1979.409929 42.366929 1800.0 1976.0 1997.0 2005.0 2016.0
prevailing_wage 25480.0 74455.814592 52815.942327 2.1367 34015.48 70308.21 107735.5125 319210.27
y 25480.0 0.667896 0.470977 0.0 0.0 1.0 1.0 1.0
count unique top freq
case_id 25480 25480 EZYV8640 1
continent 25480 6 Asia 16861
education_of_employee 25480 4 Bachelor's 10234
has_job_experience 25480 2 Y 14802
requires_job_training 25480 2 N 22525
region_of_employment 25480 5 Northeast 7195
unit_of_wage 25480 4 Year 22962
full_time_position 25480 2 Y 22773
case_status 25480 2 Certified 17018
_case_label 25480 2 Certified 17018
_edu 25480 4 Bachelor's 10234
In [26]:
# E) Build canonical frame (idempotent)
try:
    _ = df  # noqa: F401
except NameError:
    # Fallback: load if user re-runs from here
    df = read_csv_safely(data_path)

df_can, audit = build_canonical_df(df, name="EasyVisa", strict=False)
print("[OK] Canonical DF built:", df_can.shape)
print("[OK] Columns:", sorted(df_can.columns)[:12], "…")
verify_target_exists(df_can)
report_class_balance(df_can[TARGET_BIN], title="[INFO] Class balance (y)")
[EasyVisa-cols] No unknown headers. ✓
[EasyVisa-cols] MISSING expected columns: None ✓
[EasyVisa-cols] No collisions. ✓
[EasyVisa-cols] SUMMARY → unknown=0 | missing=0 | collisions=0  ⇒  PASS ✓
[EasyVisa-vals] TARGET NaNs: labels=0 | y=0
[EasyVisa-vals] Y/N invalid → job_exp=0 | job_train=0 | full_time=0
[EasyVisa-vals] Numeric NaNs → employees=0 | yr_of_estab=0 | wage=0
[EasyVisa] CANON SUMMARY → columns=PASS ✓ | values=PASS ✓  ⇒  PASS ✓

[OK] Canonical DF built: (25480, 21)
[OK] Columns: ['_case_label', '_continent', '_edu', '_full_time', '_job_exp', '_job_train', '_region', '_uow', 'case_id', 'case_status', 'continent', 'education_of_employee'] …
[OK] Target column present
[INFO] Class balance (y) →
y
1    17018
0     8462
  Imbalance ratio ≈ 2.01:1
Out[26]:
{'counts': {np.int64(1): 17018, np.int64(0): 8462}, 'ratio': 2.011108484991728}

Fixing the negative values in number of employees columns¶

In [27]:
# Clean negatives in no_of_employees on the canonical dataframe (df_can)
col = "no_of_employees"

# Work on a copy to avoid chained assignment surprises; reassign to df_can at the end
_df = df_can.copy()

# Coerce to numeric (in case of stray strings); invalid -> NaN
_df[col] = pd.to_numeric(_df[col], errors="coerce")

# Count negatives (ignore NaN)
neg_mask = _df[col] < 0
neg_count = int(neg_mask.sum())
print(f"[INFO] Found {neg_count} negative values in {col}")

# Compute mean of valid (non-negative, non-null) values
valid_mask = _df[col].ge(0) & _df[col].notna()
valid_mean = _df.loc[valid_mask, col].mean()

if np.isfinite(valid_mean):
    fill_value = int(round(valid_mean))
    print(f"[OK] Mean of valid {col} = {valid_mean:.2f} -> using {fill_value} for replacement")
    _df.loc[neg_mask, col] = fill_value
    print("[DONE] Negative values replaced with mean-rounded")
else:
    print(f"[WARN] No valid non-negative {col} values to compute mean. No replacements made.")

# Reassign back
df_can = _df
print(df_can[col].describe())
[INFO] Found 33 negative values in no_of_employees
[OK] Mean of valid no_of_employees = 5674.42 -> using 5674 for replacement
[DONE] Negative values replaced with mean-rounded
count     25480.000000
mean       5674.414796
std       22877.012865
min          12.000000
25%        1028.000000
50%        2114.000000
75%        3515.000000
max      602069.000000
Name: no_of_employees, dtype: float64

Let's check the count of each unique category in each of the categorical variables¶

In [28]:
# Unique values per column
df.nunique()
Out[28]:
case_id                  25480
continent                    6
education_of_employee        4
has_job_experience           2
requires_job_training        2
no_of_employees           7105
yr_of_estab                199
region_of_employment         5
prevailing_wage          25454
unit_of_wage                 4
full_time_position           2
case_status                  2
dtype: int64

Univariate Analysis¶

In [29]:
# Univariate Analysis Function (updated for canonical df_can)

def univariate_box_hist(
    data: pd.DataFrame,
    feature: str,
    kde: bool = False,
    bins: int | None = None,
    figsize: tuple = (12, 6),
    color: str = "#4C72B0",
    log_transform: bool = False
):
    # Prepare data
    x = pd.to_numeric(data[feature], errors="coerce").dropna()
    if log_transform:
        # Ensure non-negative before log1p (robust to any residual negatives)
        x = np.log1p(np.clip(x, a_min=0, a_max=None))

    if x.empty:
        print(f"[WARN] No numeric data available for '{feature}' after coercion/dropna.")
        return None, (None, None)

    # Create figure with two stacked subplots
    fig, (ax_box, ax_hist) = plt.subplots(
        nrows=2,
        sharex=True,
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )

    # ---- Boxplot ----
    sns.boxplot(
        x=x,
        ax=ax_box,
        color=color,
        showmeans=True,
        meanprops={"marker": "D", "markerfacecolor": "white", "markeredgecolor": "black"},
        flierprops={"marker": "o", "markersize": 4, "alpha": 0.5},
    )
    ax_box.set_title(f"{feature} — Boxplot & Histogram", fontsize=13, weight="bold", pad=10)
    ax_box.set_xlabel("")
    ax_box.grid(True, axis="x", linestyle="--", alpha=0.4)
    ax_box.set_yticks([])

    # ---- Histogram ----
    sns.histplot(
        x=x,
        kde=kde,
        bins=bins,
        ax=ax_hist,
        color=color,
        edgecolor="white",
        alpha=0.8,
    )

    # Add mean & median lines
    mean, median = float(np.mean(x)), float(np.median(x))
    ax_hist.axvline(mean, color="red", linestyle="--", linewidth=1.2, label=f"Mean = {mean:,.1f}")
    ax_hist.axvline(median, color="black", linestyle="-", linewidth=1.2, label=f"Median = {median:,.1f}")

    # Beautify
    ax_hist.legend()
    ax_hist.grid(True, axis="y", linestyle="--", alpha=0.4)
    ax_hist.set_xlabel(feature + (" (log1p)" if log_transform else ""))
    ax_hist.set_ylabel("Count")
    plt.tight_layout()
    plt.show()

    return fig, (ax_box, ax_hist)

# === Usage with canonical dataframe ===
univariate_box_hist(df_can, "no_of_employees", kde=True, bins=40)
print("\nlog1p transformed")
univariate_box_hist(df_can, "no_of_employees", kde=True, bins=40, log_transform=True)
No description has been provided for this image
log1p transformed
No description has been provided for this image
Out[29]:
(<Figure size 1200x600 with 2 Axes>,
 (<Axes: title={'center': 'no_of_employees — Boxplot & Histogram'}>,
  <Axes: xlabel='no_of_employees (log1p)', ylabel='Count'>))
In [30]:
# Bar graphs with labels (robust percent formatting; no FuncFormatter dependency)

import matplotlib.ticker as mticker 

def labeled_barplot(data: pd.DataFrame, feature: str, perc: bool = False, n: int | None = None):
    # 1) Prepare counts (robust to NaNs and mixed dtypes)
    s = data[feature].astype("string").fillna("NaN")
    vc = s.value_counts(dropna=False)

    if n is not None and n > 0 and len(vc) > n:
        head = vc.iloc[:n]
        tail_sum = int(vc.iloc[n:].sum())
        vc = pd.concat([head, pd.Series({"Other": tail_sum})])

    total = int(vc.sum())
    yvals = (vc / total * 100.0) if perc else vc

    # 2) Figure size scales with number of bars
    k = len(vc)
    fig_w = max(6.5, 0.6 * k)
    fig, ax = plt.subplots(figsize=(fig_w, 5))

    # 3) Plot
    bars = ax.bar(vc.index.astype(str), yvals.values, edgecolor="white")

    # 4) Labels on bars
    for bar, raw in zip(bars, vc.values):
        h = bar.get_height()
        label = f"{(raw/total*100):.1f}%" if perc else f"{int(raw):,}"
        ax.annotate(
            label,
            (bar.get_x() + bar.get_width()/2, h),
            ha="center", va="bottom",
            xytext=(0, 3), textcoords="offset points",
            fontsize=10
        )

    # 5) Cosmetics
    ax.set_ylabel("Percentage" if perc else "Count")
    ax.set_title(f"{feature} — distribution", weight="bold")
    ax.grid(True, axis="y", alpha=0.25)
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right")

    if perc:
        ax.yaxis.set_major_formatter(mticker.PercentFormatter(xmax=100))  # <-- fix

    plt.tight_layout()
    return fig, ax

# === Usage with canonical dataframe ===
labeled_barplot(df_can, "_region")             # counts with labels
labeled_barplot(df_can, "_edu", perc=True, n=6)  # percentages (top 6 + 'Other')
Out[30]:
(<Figure size 650x500 with 1 Axes>,
 <Axes: title={'center': '_edu — distribution'}, ylabel='Percentage'>)
No description has been provided for this image
No description has been provided for this image

Observations on region of employment¶

  • The dataset shows that most visa instances are concentrated in the Northeast and South regions that combined account for the largest share of all instances

  • The West and Midwest regions have moderate representation indicating a balanced distribution across central and western states

  • The Island region contributes a very small percentage of cases suggesting small labor demand or limited reporting

Observations on education of employee¶

  • Most of the instances fall within the Bachelor’s and Master’s degree categories, with only a small percentage at the Doctorate level.

Observations on job experience¶

In [31]:
# Job experience distribution
labeled_barplot(df, 
                "has_job_experience", 
                perc=True
                ) 
Out[31]:
(<Figure size 650x500 with 1 Axes>,
 <Axes: title={'center': 'has_job_experience — distribution'}, ylabel='Percentage'>)
No description has been provided for this image
  • The distribution shows that approx. 58% of instances have had prior jobs while 42% have not this can be one-hot encoded to 1 and 0

Observations on case status¶

In [32]:
df["_case_label"] = df["case_status"].str.strip().str.title()
labeled_barplot(df, 
                "_case_label", 
                perc=True
                )
Out[32]:
(<Figure size 650x500 with 1 Axes>,
 <Axes: title={'center': '_case_label — distribution'}, ylabel='Percentage'>)
No description has been provided for this image
  • There is an imbalance of two to one with approx. 33% of the instances in the Denied category and approx. 66% in the Certified category.

Bivariate Analysis¶

Creating functions that will help us with further analysis.

In [33]:
# Bivariate Analysis
def _fd_bins(x: np.ndarray, min_bins: int = 20, max_bins: int = 120) -> int:
    """Freedman–Diaconis rule with sensible caps."""
    x = x[~np.isnan(x)]
    if x.size < 2:
        return min_bins
    q25, q75 = np.percentile(x, [25, 75])
    iqr = q75 - q25
    if iqr <= 0:
        return min_bins
    h = 2 * iqr * (x.size ** (-1/3))
    if h <= 0:
        return min_bins
    k = int(np.ceil((x.max() - x.min()) / h))
    return max(min_bins, min(max_bins, k))

def bivariate_with_target(
    data: pd.DataFrame,
    predictor: str,
    target: str,
    *,
    bins: int | str | np.ndarray | None = None,
    log_x: bool = False,
    colors: tuple[str, str] = ("#2E8B57", "#B22222") 
):
    # Prepare data
    df = data[[predictor, target]].copy().dropna(subset=[predictor, target])
    df[target] = df[target].astype(str).str.strip().str.title()

    classes = df[target].unique()
    if len(classes) != 2:
        raise ValueError(f"{target!r} must be binary; found {len(classes)} classes: {sorted(classes)}")

    # Prefer deterministic order Certified/Denied when present; else alphabetical
    cls_set = set(classes.tolist())
    if {"Certified", "Denied"}.issubset(cls_set):
        pos, neg = "Certified", "Denied"
    else:
        pos, neg = sorted(classes)

    pal = colors if len(colors) == 2 else ("#4C72B0", "#DD8452")

    # Numeric vector, optional log
    x = pd.to_numeric(df[predictor], errors="coerce")
    df = df.loc[x.notna()].copy()
    x = x.loc[df.index]

    if log_x:
        x = np.log1p(np.clip(x, a_min=0, a_max=None))
        df[predictor] = x
        xlabel = f"log1p({predictor})"
    else:
        df[predictor] = x
        xlabel = predictor

    # Decide bins if None
    if bins is None:
        bins = _fd_bins(df[predictor].to_numpy())

    fig, axs = plt.subplots(2, 2, figsize=(12, 8), constrained_layout=True)

    # KDE hist – positive class
    sns.histplot(
        data=df[df[target] == pos],
        x=predictor,
        bins=bins,
        kde=True,
        stat="density",
        ax=axs[0, 0],
        color=pal[0],
        edgecolor="white",
        alpha=0.85
    )
    axs[0, 0].set_title(f"{predictor} | {target} = {pos}", weight="bold")
    axs[0, 0].set_xlabel(xlabel)
    axs[0, 0].grid(axis="y", alpha=.25)

    # KDE hist – negative class
    sns.histplot(
        data=df[df[target] == neg],
        x=predictor,
        bins=bins,
        kde=True,
        stat="density",
        ax=axs[0, 1],
        color=pal[1],
        edgecolor="white",
        alpha=0.85
    )
    axs[0, 1].set_title(f"{predictor} | {target} = {neg}", weight="bold")
    axs[0, 1].set_xlabel(xlabel)
    axs[0, 1].grid(axis="y", alpha=.25)

    # Boxplot with outliers
    sns.boxplot(
        data=df, x=target, y=predictor,
        hue=target, dodge=False, palette=pal, ax=axs[1, 0],
    )
    if axs[1, 0].legend_ is not None:
        axs[1, 0].legend_.remove()
    axs[1, 0].set_title("Boxplot w.r.t. target", weight="bold")
    axs[1, 0].grid(axis="y", alpha=.25)

    # Boxplot WITHOUT outliers
    sns.boxplot(
        data=df, x=target, y=predictor,
        hue=target, dodge=False, palette=pal, showfliers=False, ax=axs[1, 1],
    )
    if axs[1, 1].legend_ is not None:
        axs[1, 1].legend_.remove()
    axs[1, 1].set_title("Boxplot (no outliers) w.r.t. target", weight="bold")
    axs[1, 1].grid(axis="y", alpha=.25)

    sns.set_style("whitegrid")
    sns.set_context("notebook", font_scale=1.1)
    plt.show()
    return fig, axs

bivariate_with_target(
    df_can, predictor="no_of_employees", target=TARGET_LBL,
    log_x=True, bins=60, colors=("#007ACC", "#FF6600")
)
No description has been provided for this image
Out[33]:
(<Figure size 1200x800 with 4 Axes>,
 array([[<Axes: title={'center': 'no_of_employees | _case_label = Certified'}, xlabel='log1p(no_of_employees)', ylabel='Density'>,
         <Axes: title={'center': 'no_of_employees | _case_label = Denied'}, xlabel='log1p(no_of_employees)', ylabel='Density'>],
        [<Axes: title={'center': 'Boxplot w.r.t. target'}, xlabel='_case_label', ylabel='no_of_employees'>,
         <Axes: title={'center': 'Boxplot (no outliers) w.r.t. target'}, xlabel='_case_label', ylabel='no_of_employees'>]],
       dtype=object))
In [34]:
def stacked_barplot(data: pd.DataFrame, predictor: str, target: str):
    """
    Create and print category counts, then plot a 100% stacked bar chart.

    Parameters
    ----------
    data : DataFrame
        The dataframe containing predictor and target columns.
    predictor : str
        Independent (categorical) variable.
    target : str
        Target (binary/categorical) variable.
    """

    # sanity check
    if predictor not in data.columns or target not in data.columns:
        raise KeyError(f"Columns '{predictor}' or '{target}' not found in dataframe.")

    # cross-tabulations
    raw_tab = pd.crosstab(data[predictor], data[target], margins=True)
    print(raw_tab)
    print("-" * 120)

    norm_tab = pd.crosstab(data[predictor], data[target], normalize="index") * 100

    # Sort by the proportion of the minority/second class for consistency
    if norm_tab.shape[1] > 1:
        # choose the smallest or last column for sorting
        sort_col = norm_tab.columns[-1]
        norm_tab = norm_tab.sort_values(by=sort_col, ascending=False)

    # Plot
    fig, ax = plt.subplots(figsize=(max(8, len(norm_tab)*0.7), 5))
    norm_tab.plot(kind="bar", stacked=True, ax=ax, color=["#007ACC", "#FF6600"])
    sns.set_style("whitegrid")
    sns.set_context("notebook", font_scale=1.1)

    # label formatting
    ax.set_ylabel("Percentage")
    ax.set_xlabel(predictor)
    ax.set_title(f"{predictor} vs {target} — Class Composition (100%)", fontweight="bold")
    ax.legend(title=target, bbox_to_anchor=(1.02, 1), loc="upper left", frameon=False)
    ax.grid(axis="y", alpha=0.25)
    plt.tight_layout()
    plt.tick_params(axis='x', rotation=45)
    plt.show()

stacked_barplot(df, "education_of_employee", "_case_label")
stacked_barplot(df, "region_of_employment", "_case_label")
stacked_barplot(df, "has_job_experience", "_case_label")
_case_label            Certified  Denied    All
education_of_employee                          
Bachelor's                  6367    3867  10234
Doctorate                   1912     280   2192
High School                 1164    2256   3420
Master's                    7575    2059   9634
All                        17018    8462  25480
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
_case_label           Certified  Denied    All
region_of_employment                          
Island                      226     149    375
Midwest                    3253    1054   4307
Northeast                  4526    2669   7195
South                      4913    2104   7017
West                       4100    2486   6586
All                       17018    8462  25480
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
_case_label         Certified  Denied    All
has_job_experience                          
N                        5994    4684  10678
Y                       11024    3778  14802
All                     17018    8462  25480
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image

Does higher education increase the chances of visa certification for well-paid jobs abroad?¶

In [35]:
# Higher education vs visa certification

# Use cleaned education + canonical target label
edu_series = (df_can["_edu"] if "_edu" in df_can.columns else df_can["education_of_employee"]).astype("string")
target_col = TARGET_LBL if "TARGET_LBL" in globals() else "_case_label"
y = (df_can[target_col].astype("string").str.strip().str.title() == "Certified").astype(int)

# Natural order
order = ["High School", "Bachelor's", "Master's", "Doctorate"]

# Compute rate, n, and 95% CI
stats = (
    pd.DataFrame({"edu": edu_series, "y": y})
      .dropna(subset=["edu"])
      .groupby("edu")
      .agg(p=("y", "mean"), n=("y", "size"))
      .reindex(order)
)

se = np.sqrt(stats["p"] * (1 - stats["p"]) / stats["n"])
ci = 1.96 * se

# Drop categories with n == 0 or NaN before plotting
plot_stats = stats.copy()
plot_stats["ci"] = ci
plot_stats = plot_stats.loc[plot_stats["n"].fillna(0) > 0]

# Plot
fig, ax = plt.subplots(figsize=(7, 4))
ax.errorbar(
    plot_stats["p"].values,
    np.arange(len(plot_stats)),
    xerr=plot_stats["ci"].values,
    fmt="o", capsize=4,
    color="#007ACC", ecolor="#005A99",
    elinewidth=1.5, markeredgewidth=1.5
)
ax.set_yticks(np.arange(len(plot_stats)))
ax.set_yticklabels(plot_stats.index, fontsize=10)
ax.set_xlabel("Certification Rate", fontsize=11)
ax.set_title("Visa Approval Rate by Education (±95% CI)", fontsize=12, weight="bold")
ax.set_xlim(0, 1)
ax.grid(axis="x", alpha=0.25)
sns.despine()
plt.tight_layout()
plt.show()

# Build summary table: rate, n, 95% CI
_overall = y.mean()
def _wilson_ci(k, n, z=1.96):
    if n == 0: return (np.nan, np.nan)
    p = k / n
    denom = 1 + (z**2)/n
    center = (p + (z**2)/(2*n)) / denom
    half = z * np.sqrt((p*(1-p) + (z**2)/(4*n)) / n) / denom
    return center - half, center + half

k = (pd.DataFrame({"edu": edu_series, "y": y})
       .dropna(subset=["edu"])
       .groupby("edu")["y"].sum()
       .reindex(order))

wil = [_wilson_ci(int(k.iloc[i]), int(stats["n"].iloc[i])) for i in range(len(stats))]
ci_low  = [w[0] for w in wil]
ci_high = [w[1] for w in wil]

rate_df = (
    pd.DataFrame({
        "rate":   stats["p"],
        "n":      stats["n"].astype("Int64"),
        "ci_low": ci_low,
        "ci_high": ci_high,
        "rate_pct": (100*stats["p"]).round(1),
        "lift_pp": ((stats["p"] - _overall) * 100).round(1),
    })
    .dropna(subset=["n"])
    .round(4)
)
rate_df.index.name = "education"
rate_df
No description has been provided for this image
Out[35]:
rate n ci_low ci_high rate_pct lift_pp
education
High School 0.3404 3420 0.3247 0.3564 34.0 -32.8
Bachelor's 0.6221 10234 0.6127 0.6315 62.2 -4.6
Master's 0.7863 9634 0.7780 0.7943 78.6 11.8
Doctorate 0.8723 2192 0.8576 0.8856 87.2 20.4

Yes, combined with the previously observed baseline of 66.8% the findings show:

  • High school has only a ~34% signal with -32.77pp

  • Bachelors shows the signal is ~62% with -4.8pp

  • Masters delivers the first signal lift at ~79% with 12.2pp

  • Doctorate is the strongest signal lift at ~87% with 20.43pp

This indicates stronger rates of visa certification with every progression of education from high school in steps of 3x, 7x, and 13x respectively

How does visa status vary across different continents?¶

In [36]:
# Pretty categorical bivariate for DF + TARGET_LBL 
import matplotlib.ticker as mticker

def bivariate_cat_pretty(
    data: pd.DataFrame,
    predictor: str,
    target: str,
    *,
    colors=("#2B83F6", "#FF8C42"),
    order_by="rate",
    ascending=True,
    return_stats=False,
):
    # Checks
    assert predictor in data.columns, f"Missing predictor: {predictor}"
    assert target in data.columns,    f"Missing target: {target}"

    df = data[[predictor, target]].dropna()
    y = df[target].astype(str).str.strip().str.title()
    classes = y.unique()
    assert len(classes) == 2, f"Target must be binary; got {sorted(map(str, classes))}"
    pos, neg = ("Certified", "Denied") if {"Certified","Denied"}.issubset(set(classes)) else tuple(sorted(classes))

    x = df[predictor].astype("string").str.strip().replace("", "NaN")
    ctab = pd.crosstab(x, y)
    for c in (pos, neg):
        if c not in ctab.columns: ctab[c] = 0
    ctab = ctab[[pos, neg]]

    n = ctab.sum(axis=1)
    p = (ctab[pos] / n).fillna(0.0)
    se = np.sqrt(p * (1 - p) / n.replace(0, np.nan))
    ci = (1.96 * se).fillna(0.0)

    # ordering
    if isinstance(order_by, (list, tuple, pd.Index)):
        order = list(order_by)
    elif order_by == "alpha":
        order = sorted(ctab.index.astype(str))
    elif order_by == "count":
        order = n.sort_values(ascending=ascending).index.tolist()
    else:  # "rate"
        order = p.sort_values(ascending=ascending).index.tolist()

    ctab, n, p, ci = ctab.reindex(order), n.reindex(order), p.reindex(order), ci.reindex(order)
    perc = (ctab.div(n.replace(0, np.nan), axis=0) * 100).fillna(0)

    # Plot
    pretty_pred = predictor.lstrip("_").replace("_", " ").title()
    fig_w = max(8, 0.9 * len(order))
    fig, (ax0, ax1) = plt.subplots(2, 1, figsize=(fig_w, 7), gridspec_kw={"height_ratios":[3,2]}, constrained_layout=True)
    sns.set_style("whitegrid"); sns.set_context("notebook", font_scale=1.05)

    # Top: 100% stacked with n labels
    bottom = np.zeros(len(order))
    for cls, col in [(pos, colors[0]), (neg, colors[1])]:
        vals = perc[cls].to_numpy()
        ax0.bar(perc.index.astype(str), vals, bottom=bottom, label=cls, color=col, edgecolor="white", alpha=.9)
        bottom += vals
    ax0.set_ylim(0, 102)
    for i, n_i in enumerate(n.to_numpy()):
        ax0.annotate(f"n={int(n_i)}", (i, 101), ha="center", va="bottom", fontsize=9, clip_on=False)
    ax0.set_ylabel("Class mix (%)")
    ax0.yaxis.set_major_formatter(mticker.PercentFormatter(xmax=100))
    ax0.set_title(f"{pretty_pred} — class composition", weight="bold", pad=18)  
    ax0.legend(title=target, bbox_to_anchor=(1.02, 1), loc="upper left")
    ax0.tick_params(axis="x", rotation=45)
    ax0.grid(axis="y", alpha=.25)

    # Bottom: approval rate ±95% CI sorted
    y_pos = np.arange(len(p))
    ax1.errorbar(p.values, y_pos, xerr=ci.values, fmt="o", ms=6,
                 mfc="white", mec="#1f77b4", ecolor="#1f77b4", capsize=4, lw=1.5)
    ax1.set_yticks(y_pos); ax1.set_yticklabels(p.index.astype(str))
    ax1.set_xlim(0, 1); ax1.set_xlabel("P(Certified)")
    ax1.set_title(f"Approval rate by {pretty_pred} (±95% CI)", weight="bold")
    ax1.grid(axis="x", alpha=.25)

    sns.despine()
    if return_stats:
        stats = pd.DataFrame({"n": n.astype(int), "p_certified": p, "ci_low": (p-ci).clip(0,1), "ci_high": (p+ci).clip(0,1)})
        return fig, (ax0, ax1), stats
    return fig, (ax0, ax1)

# CONTINENT ONLY
fig, (ax0, ax1) = bivariate_cat_pretty(
    DF, predictor="_continent", target=TARGET_LBL,
    colors=("#2B83F6", "#FF8C42"),
    order_by="rate", ascending=True
)
plt.show()

# Baseline overall certification rate
_overall = (DF[TARGET_LBL].astype(str).str.strip().str.title() == "Certified").mean()

# Wilson 95% CI for a proportion
def _wilson_ci(k, n, z=1.96):
    if n == 0:
        return (np.nan, np.nan)
    p = k / n
    denom = 1 + (z**2)/n
    center = (p + (z**2)/(2*n)) / denom
    half = z * np.sqrt((p*(1-p) + (z**2)/(4*n)) / n) / denom
    return center - half, center + half

# Continent table counts, rate, 95% CI, lift vs overall
_tmp = DF[["continent", TARGET_LBL]].dropna()
_y = (_tmp[TARGET_LBL].astype(str).str.strip().str.title() == "Certified").astype(int)

_g = (
    pd.concat([_tmp[["continent"]].reset_index(drop=True), _y.rename("y")], axis=1)
      .groupby("continent", dropna=False)["y"]
      .agg(n="count", certified="sum", rate="mean")
      .reset_index()
)

_cis = _g.apply(lambda r: _wilson_ci(r["certified"], r["n"]), axis=1)
_g["ci_low"]   = [c[0] for c in _cis]
_g["ci_high"]  = [c[1] for c in _cis]
_g["rate_pct"] = (100 * _g["rate"]).round(1)
_g["lift_pp"]  = ((_g["rate"] - _overall) * 100).round(1)

continent_tbl = _g[["continent","n","rate","rate_pct","ci_low","ci_high","lift_pp"]] \
                   .sort_values("rate", ascending=True)

display(continent_tbl)
continent_tbl.to_csv("continent_approval_table.csv", index=False) 
No description has been provided for this image
continent n rate rate_pct ci_low ci_high lift_pp
5 South America 852 0.578638 57.9 0.545202 0.611369 -8.9
3 North America 3292 0.618773 61.9 0.602052 0.635217 -4.9
4 Oceania 192 0.635417 63.5 0.565297 0.700224 -3.2
1 Asia 16861 0.653105 65.3 0.645886 0.660254 -1.5
0 Africa 551 0.720508 72.1 0.681610 0.756353 5.3
2 Europe 3732 0.792337 79.2 0.779025 0.805047 12.4

Yes, combined with the previously observed baseline of 66.8% the findings show:

  • Europe had the highest signal lift at ~79.2% with +12.4pp

  • Africa came in with a lift signal of ~72.1% with +5.3pp

  • Asia the signal is ~65.3% with −1.5pp

  • Oceania the signal is ~63.5% with −3.2pp

  • North America the signal is ~61.9% with −4.9pp

  • South America the signal is ~57.9% with −8.9pp

This indicates continent-level differences in certification, with Europe showing the strongest positive lift and South America the lowest

Does having prior work experience influence the chances of visa certification for career opportunities abroad?¶

In [37]:
# Approval rate by job experience

assert TARGET_LBL in DF.columns, f"Missing target: {TARGET_LBL}"
exp_col = "_job_exp" if "_job_exp" in DF.columns else "has_job_experience"
assert exp_col in DF.columns, f"Missing experience column: {exp_col}"

# Map Y/N -> Yes/No keep Unknown, and set a stable categorical order
exp = (DF[exp_col].astype("string").str.strip().str.upper()
         .map({"Y": "Yes", "N": "No"})
         .fillna("Unknown")
         .astype("category")
         .cat.set_categories(["Yes", "No", "Unknown"], ordered=True))

# Binary target
y = (DF[TARGET_LBL].astype("string").str.strip().str.title().eq("Certified")).astype("int8")

# Group with explicit observed=True and get rate + n
grp = (pd.DataFrame({"exp": exp, "y": y})
         .groupby("exp", observed=True)
         .agg(rate=("y", "mean"), n=("y", "size")))

# 95% CI avoid divide-by-zero
se = np.sqrt(grp["rate"] * (1 - grp["rate"]) / grp["n"].replace(0, np.nan))
ci = 1.96 * se
grp["ci"] = ci

# Plot and drop empty categories to avoid blank bars
plot_df = grp.loc[grp["n"] > 0]

fig, ax = plt.subplots(figsize=(6, 4))
ax.bar(plot_df.index.astype(str), plot_df["rate"].values, edgecolor="white")
ax.errorbar(plot_df.index.astype(str), plot_df["rate"].values,
            yerr=plot_df["ci"].values, fmt="none", capsize=3, color="k")
ax.set_ylim(0, 1)
ax.set_ylabel("Certification rate")
ax.set_title("Approval rate by job experience (±95% CI)")
ax.grid(axis="y", alpha=.25)
plt.tight_layout()
plt.show()

_overall = y.mean()

def _wilson_ci(k, n, z=1.96):
    if n == 0:
        return (np.nan, np.nan)
    p = k / n
    denom = 1 + (z**2)/n
    center = (p + (z**2)/(2*n)) / denom
    half = z * np.sqrt((p*(1-p) + (z**2)/(4*n)) / n) / denom
    return center - half, center + half

order = ["Yes", "No", "Unknown"]
n_series = grp["n"].reindex(order).astype("Int64")
rate_series = grp["rate"].reindex(order)
k_series = (
    pd.DataFrame({"exp": exp, "y": y})
      .groupby("exp", observed=True)["y"]
      .sum()
      .reindex(order)
      .fillna(0)
      .astype(int)
)

cis = [ _wilson_ci(int(k_series.iloc[i]), int(n_series.iloc[i] if pd.notna(n_series.iloc[i]) else 0))
        for i in range(len(order)) ]
ci_low  = [c[0] for c in cis]
ci_high = [c[1] for c in cis]

exp_tbl = (
    pd.DataFrame({
        "rate": rate_series,
        "n": n_series,
        "ci_low": ci_low,
        "ci_high": ci_high,
        "rate_pct": (100*rate_series).round(1),
        "lift_pp": ((rate_series - _overall)*100).round(1),
    })
    .dropna(subset=["n"])
    .round(4)
)
exp_tbl.index.name = "job_experience"
exp_tbl
No description has been provided for this image
Out[37]:
rate n ci_low ci_high rate_pct lift_pp
job_experience
Yes 0.7448 14802 0.7377 0.7517 74.5 7.7
No 0.5613 10678 0.5519 0.5707 56.1 -10.7

Yes, combined with the previously observed baseline of 66.8% the findings show:

  • Yes (prior experience) the signal lift is ~74.5% with +7.7pp

  • No (no experience) the signal is ~56.1% with −10.7pp

This indicates a strong positive effect of prior work experience on certification likelihood

Is the prevailing wage consistent across all regions of the US?¶

In [38]:
# Prevailing wage by region 
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns
import scipy.stats as st

# Columns
region_col = "_region" if "_region" in DF.columns else "region_of_employment"
assert region_col in DF.columns, f"Missing region column: {region_col}"

if "wage_usd_annual" in DF.columns:
    tmp = DF[[region_col, "wage_usd_annual"]].copy()
else:
    # Compute annual wage from prevailing_wage + unit of wage
    assert "prevailing_wage" in DF.columns, "Need 'prevailing_wage' to compute annual wage."
    uow_col = "_uow" if "_uow" in DF.columns else ("unit_of_wage" if "unit_of_wage" in DF.columns else None)
    assert uow_col is not None, "Need wage unit column '_uow' or 'unit_of_wage'."

    _factor = {
        "Hour": 2080.0, "Week": 52.0, "Month": 12.0, "Year": 1.0, "Yr": 1.0,
        "Annual": 1.0, "Per Year": 1.0, "Bi-Week": 26.0, "Biweekly": 26.0, "Day": 260.0,
    }
    tmp = DF[[region_col, "prevailing_wage", uow_col]].copy()
    tmp["wage_usd_annual"] = (
        pd.to_numeric(tmp["prevailing_wage"], errors="coerce")
        * tmp[uow_col].astype("string").str.strip().str.title().map(_factor)
    )

# Tidy & log transform
tmp = tmp.dropna(subset=[region_col, "wage_usd_annual"])
tmp[region_col] = tmp[region_col].astype("string").str.strip()
tmp["log_wage"] = np.log1p(tmp["wage_usd_annual"].clip(lower=0))

# Order regions by median wage for a neat plot
order = (tmp.groupby(region_col, observed=True)["wage_usd_annual"]
           .median().sort_values().index.tolist())
tmp[region_col] = pd.Categorical(tmp[region_col], categories=order, ordered=True)

# Boxplot
plt.figure(figsize=(8, 4.5))
sns.boxplot(data=tmp, x=region_col, y="log_wage")
plt.title("Yearly wage (log1p) by region")
plt.grid(axis="y", alpha=.25)
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

# Kruskal–Wallis across regions
groups = [g["log_wage"].to_numpy() for _, g in tmp.groupby(region_col, observed=True)]
H, p = st.kruskal(*groups, nan_policy="omit")
k, n = len(groups), int(tmp.shape[0])
eta2 = (H - k + 1) / max(n - k, 1)  # ε² effect size
print(f"Kruskal–Wallis H={H:.2f}, p={p:.3g} | epsilon^2≈{eta2:.3f} (k={k}, n={n})")

# Robust table: prevailing wage by region (median, 95% CI, lift vs overall median)

dfw = tmp[[region_col, "wage_usd_annual"]].dropna().copy()
dfw["wage_usd_annual"] = pd.to_numeric(dfw["wage_usd_annual"], errors="coerce")
dfw = dfw.dropna()

_overall_med = float(dfw["wage_usd_annual"].median())

def _median_ci(x, z=1.96):
    x = np.asarray(x, dtype=float)
    x = x[~np.isnan(x)]
    n = x.size
    if n == 0:
        return (np.nan, np.nan, np.nan)
    xs = np.sort(x)
    med = float(np.median(xs))
    k = int(np.floor(0.5*n - z*np.sqrt(n)/2.0))
    m = int(np.ceil (0.5*n + z*np.sqrt(n)/2.0)) - 1
    k = max(k, 0); m = min(m, n-1)
    return med, float(xs[k]), float(xs[m])

rows = []
for r, g in dfw.groupby(region_col, observed=True):
    med, lo, hi = _median_ci(g["wage_usd_annual"].values)
    rows.append((str(r), int(len(g)), med, lo, hi, med - _overall_med))

region_wage_tbl = (
    pd.DataFrame(rows, columns=[region_col, "n", "median_usd", "ci_low_usd", "ci_high_usd", "lift_usd"])
      .sort_values("median_usd", ascending=False)
      .reset_index(drop=True)
      .assign(
          median_usd=lambda d: d["median_usd"].round(2),
          ci_low_usd=lambda d: d["ci_low_usd"].round(2),
          ci_high_usd=lambda d: d["ci_high_usd"].round(2),
          lift_usd=lambda d: d["lift_usd"].round(2),
      )
)

region_wage_tbl

# Sensitivity 1 — Year-only rows (no annualization)
reg = "_region" if "_region" in DF.columns else "region_of_employment"
uow = "_uow" if "_uow" in DF.columns else "unit_of_wage"
year_only = DF[(DF[uow].astype(str).str.title()=="Year") & DF["prevailing_wage"].notna()]
tab_year = (year_only[[reg,"prevailing_wage"]]
            .assign(wage_usd_annual=pd.to_numeric(year_only["prevailing_wage"], errors="coerce"))
            .dropna()
            .groupby(reg)["wage_usd_annual"]
            .agg(n="size", median_usd="median")
            .sort_values("median_usd", ascending=False))
print("\nYear-only rows (no annualization)")
display(tab_year)

# Sensitivity 2 — Winsorize annual wages (0.5% tails) before medians
w = dfw["wage_usd_annual"].values
lo, hi = np.percentile(w[~np.isnan(w)], [0.5, 99.5])
dfw_wins = dfw.assign(w_wins=np.clip(dfw["wage_usd_annual"], lo, hi))
tab_wins = (dfw_wins.groupby(reg, observed=True)["w_wins"]
            .agg(n="size", median_usd="median")
            .sort_values("median_usd", ascending=False))
print("\nWinsorized wages (0.5% tails)")
display(tab_wins)

# Sensitivity 3 — Bootstrap CI and difference: Island vs Midwest medians
rng = np.random.default_rng(72)
def boot_med(x, B=5000):
    x = np.asarray(x, float); x = x[~np.isnan(x)]
    n = x.size; idx = rng.integers(0, n, size=(B, n))
    return np.median(x[idx], axis=1)

m_island  = dfw.loc[dfw[reg]=="Island",  "wage_usd_annual"].values
m_midwest = dfw.loc[dfw[reg]=="Midwest", "wage_usd_annual"].values
bI = boot_med(m_island); bM = boot_med(m_midwest)
diff = bI - bM
ci = np.quantile(diff, [0.025, 0.975])
p = (np.abs(diff).mean() >= 0).item()  # not a formal p, just keep CI
print(f"Island–Midwest median diff (bootstrap): {np.median(m_island)-np.median(m_midwest):.0f} "
      f"CI[{ci[0]:.0f}, {ci[1]:.0f}]")

# Sensitivity 4 — Control for mix: median wage by region within education
edu = "_edu" if "_edu" in DF.columns else "education_of_employee"
dfw2 = DF[[reg, edu, "prevailing_wage", uow]].dropna()
factor = {"Hour":2080,"Week":52,"Month":12,"Year":1}
dfw2["wage_usd_annual"] = (pd.to_numeric(dfw2["prevailing_wage"], errors="coerce") *
                           dfw2[uow].astype(str).str.title().map(factor))
mix_tab = (dfw2.groupby([reg, edu], observed=True)["wage_usd_annual"]
                .median().unstack(edu).round(0))
display(mix_tab)

# Sanity scan — Island distribution, extremes, duplicates
island = dfw.loc[dfw[reg]=="Island", "wage_usd_annual"].sort_values()
print(island.describe(percentiles=[.01,.05,.5,.95,.99]).round(2))
print("Top 5:", island.tail(5).round(0).tolist())
print("Bottom 5:", island.head(5).round(0).tolist())
dup_rate = (island.value_counts().max() / max(1, island.size))
print(f"Duplicate-value concentration: {dup_rate:.3f}")

_overall_med = float(dfw["wage_usd_annual"].median())
print(f"Baseline median (all regions): ${_overall_med:,.0f}")
No description has been provided for this image
Kruskal–Wallis H=268.19, p=7.81e-57 | epsilon^2≈0.010 (k=5, n=25480)

Year-only rows (no annualization)
n median_usd
_region
Island 354 93528.190
Midwest 4116 91357.070
South 6225 76946.330
Northeast 6209 72129.750
West 6058 68485.425
Winsorized wages (0.5% tails)
n median_usd
_region
Island 375 96317.45
Midwest 4307 94520.54
South 7017 84812.35
Northeast 7195 81428.85
West 6586 73867.56
Island–Midwest median diff (bootstrap): 1797 CI[-4287, 7066]
_edu Bachelor's Doctorate High School Master's
_region
Island 99102.0 83226.0 98736.0 96988.0
Midwest 92423.0 80998.0 95779.0 97075.0
Northeast 83410.0 60821.0 84834.0 83254.0
South 85733.0 67438.0 84296.0 87380.0
West 78231.0 53483.0 73431.0 73509.0
count         375.00
mean       170732.10
std        657212.72
min          1985.30
1%           7078.82
5%          24117.07
50%         96317.45
95%        261664.73
99%       1437960.85
max      10074525.76
Name: wage_usd_annual, dtype: float64
Top 5: [1413260.0, 1508265.0, 2420985.0, 7199995.0, 10074526.0]
Bottom 5: [1985.0, 4508.0, 4890.0, 5274.0, 7713.0]
Duplicate-value concentration: 0.003
Baseline median (all regions): $82,839

Yes, combined with the previously observed baseline median of $82,839 the findings show:

  • Island has the highest signal at ~$96,317 with +$13,478

  • Midwest the signal is ~$94,521 with +$11,681

  • South the signal is ~$84,812 with +$1,973

  • Northeast the signal is ~$81,429 with −$1,411

  • West the signal is lowest with ~$73,868 with −$8,972

Confidence shows Island and Midwest are not decisively different the bootstrap CI for Island − Midwest spans zero so treating that gap as suggestive not conclusive

The pattern survives Year-only rows with no conversions, persists within education levels, and uses a median that is robust to wage outliers.

Despite higher wages, Island has the lowest approval rate ~60% making wage and approval separate signals here.

Does visa status vary with changes in the prevailing wage set to protect both local talent and foreign workers?¶

In [39]:
# Prevailing wage → approval (quartiles) visual + robust table

# ensure annualized wage
uow_col = "_uow" if "_uow" in DF.columns else ("unit_of_wage" if "unit_of_wage" in DF.columns else None)
if "_wage_yearly" in DF.columns:
    wage = pd.to_numeric(DF["_wage_yearly"], errors="coerce")
else:
    assert "prevailing_wage" in DF.columns and uow_col is not None, "Need wage + unit columns"
    _factor = {"Hour":2080.0,"Week":52.0,"Month":12.0,"Year":1.0,"Yr":1.0,"Annual":1.0,"Per Year":1.0,"Bi-Week":26.0,"Biweekly":26.0,"Day":260.0}
    wage = pd.to_numeric(DF["prevailing_wage"], errors="coerce") * DF[uow_col].astype(str).str.strip().str.title().map(_factor)

# target
yy = (DF[TARGET_LBL].astype(str).str.strip().str.title()=="Certified").astype("int8")

# quartiles
bands = pd.qcut(wage, 4, labels=["Q1 (lowest)","Q2","Q3","Q4 (highest)"], duplicates="drop")

# group stats
g = (pd.DataFrame({"band": bands, "y": yy})
       .dropna(subset=["band"])
       .groupby("band", observed=True)["y"]
       .agg(n="count", certified="sum", rate="mean")
       .reset_index())

# Wilson CI
def _wilson(k,n,z=1.96):
    if n==0: return (np.nan,np.nan)
    p=k/n; d=1+z*z/n
    c=(p+z*z/(2*n))/d
    h=z*np.sqrt((p*(1-p)+z*z/(4*n))/n)/d
    return (c-h, c+h)

cis = g.apply(lambda r: _wilson(int(r["certified"]), int(r["n"])), axis=1)
g["ci_low"]  = [c[0] for c in cis]
g["ci_high"] = [c[1] for c in cis]

# lift vs baseline
overall = yy.mean()
g["rate_pct"] = (100*g["rate"]).round(1)
g["lift_pp"]  = ((g["rate"] - overall)*100).round(1)

wage_quartile_tbl = g[["band","n","rate","rate_pct","ci_low","ci_high","lift_pp"]]

# visual
plt.figure(figsize=(6.8, 4))
x = np.arange(len(wage_quartile_tbl))
plt.errorbar(
    wage_quartile_tbl["rate"].values,
    x,
    xerr=(wage_quartile_tbl["rate"]-wage_quartile_tbl["ci_low"]).values,
    fmt="o", capsize=4, color="#007ACC", ecolor="#005A99", elinewidth=1.5, markeredgewidth=1.5
)
plt.yticks(x, wage_quartile_tbl["band"].astype(str).tolist())
plt.xlabel("Certification Rate")
plt.title("Visa Approval Rate by Wage Quartile (±95% CI)")
plt.xlim(0, 1)
plt.grid(axis="x", alpha=.25)
sns.despine()
plt.tight_layout()
plt.show()

# table
wage_quartile_tbl
No description has been provided for this image
Out[39]:
band n rate rate_pct ci_low ci_high lift_pp
0 Q1 (lowest) 6370 0.711617 71.2 0.700367 0.722612 4.4
1 Q2 6370 0.696389 69.6 0.684982 0.707560 2.8
2 Q3 6370 0.684772 68.5 0.673254 0.696068 1.7
3 Q4 (highest) 6370 0.578807 57.9 0.566638 0.590881 -8.9

No, combined with the previously observed baseline of 66.8% the findings show:

  • Q1 (lowest) the signal is ~71.2% with +4.4pp

  • Q2 the signal is ~69.6% with +2.8pp

  • Q3 the signal is ~68.5% with +1.7pp

  • Q4 (highest) the signal is ~57.9% with −8.9pp

This tells us wage does not boost certification in the raw split—top-wage roles actually sit below baseline while lower/mid bands are modestly above

Does the unit of prevailing wage (Hourly, Weekly, etc.) have any impact on the likelihood of visa application certification?¶

In [40]:
# Unit-of-wage → approval
uow_col = "_uow" if "_uow" in DF.columns else "unit_of_wage"
yy = (DF[TARGET_LBL].astype(str).str.strip().str.title()=="Certified").astype(int)
overall = yy.mean()

# normalize unit labels
uow = (DF[uow_col].astype("string").str.strip().str.title()
         .replace({"Yr":"Year","Per Year":"Year","Annual":"Year","Bi-Week":"Biweekly"}))

def _wilson(k,n,z=1.96):
    if n==0: return (np.nan,np.nan)
    p=k/n; d=1+z*z/n
    c=(p+z*z/(2*n))/d
    h=z*np.sqrt((p*(1-p)+z*z/(4*n))/n)/d
    return (c-h, c+h)

g = (pd.DataFrame({"unit": uow, "y": yy})
       .dropna(subset=["unit"])
       .groupby("unit", observed=True)["y"]
       .agg(n="count", certified="sum", rate="mean")
       .reset_index())

cis = g.apply(lambda r: _wilson(int(r["certified"]), int(r["n"])), axis=1)
g["ci_low"]  = [c[0] for c in cis]
g["ci_high"] = [c[1] for c in cis]
g["rate_pct"] = (100*g["rate"]).round(1)
g["lift_pp"]  = ((g["rate"] - overall)*100).round(1)

uow_tbl = g[["unit","n","rate","rate_pct","ci_low","ci_high","lift_pp"]] \
            .sort_values("rate", ascending=False)

# Bar + error bars for unit-of-wage approval
_plot = uow_tbl.copy()
x = np.arange(len(_plot))
y = _plot["rate"].values
yerr = ( _plot["rate"] - _plot["ci_low"] ).values  # symmetric enough here

plt.figure(figsize=(6.6, 4))
plt.bar(x, y, edgecolor="white")
plt.errorbar(x, y, yerr=yerr, fmt="none", capsize=3, color="k")
plt.xticks(x, _plot["unit"].astype(str).tolist())
plt.ylim(0, 1)
plt.ylabel("Certification rate")
plt.title("Approval rate by unit of wage (±95% CI)")
plt.grid(axis="y", alpha=.25)
plt.tight_layout()
plt.show()

display(uow_tbl)
No description has been provided for this image
unit n rate rate_pct ci_low ci_high lift_pp
3 Year 22962 0.698850 69.9 0.692884 0.704750 3.1
2 Week 272 0.621324 62.1 0.562366 0.676901 -4.7
1 Month 89 0.617978 61.8 0.514139 0.712052 -5.0
0 Hour 2157 0.346314 34.6 0.326524 0.366651 -32.2

Yes, combined with the observed baseline of 66.8% the findings show:

  • Year the signal is ~69.9% with +3.1pp

  • Week the signal is ~62.1% with −4.7pp

  • Month the signal is ~61.8% with −5.0pp

  • Hour the signal is ~34.6% with −32.2pp

This is a unit/mix effect not causation - going to annualize wage for modeling and treat unit as diagnostic

Data Pre-processing¶

Data Preparation for modeling + Outlier Check¶

In [41]:
# Data Pre-processing → Outlier Check

#  Wage → annual USD from unit_of_wage / _uow
uow_col = "_uow" if "_uow" in DF.columns else "unit_of_wage"

def _unit_to_year_mult(u):
    if pd.isna(u): return np.nan
    s = str(u).strip().lower()
    if s.startswith("hour") or s in {"hr","h","per hour"}: return 2080
    if s.startswith("week") or s in {"wk","w","per week"}: return 52
    if s.startswith("month") or s in {"mo","m","per month"}: return 12
    if s.startswith("year") or s in {"yr","y","per year","annual"}: return 1
    # fallback by first letter
    return {"h":2080,"w":52,"m":12,"y":1}.get(s[:1], np.nan)

mult = DF[uow_col].map(_unit_to_year_mult)
DF["wage_usd_yr"] = DF["prevailing_wage"] * mult
DF["log_wage_usd_yr"] = np.log1p(DF["wage_usd_yr"])

# Employees fix negatives
DF["employees_raw"]  = DF["no_of_employees"]
DF["employees_clip0"] = DF["no_of_employees"].clip(lower=0)
DF["log_employees"]   = np.log1p(DF["employees_clip0"])

# Company age filing-year proxy 2016
FILING_YEAR_PROXY = 2016
DF["company_age"] = (FILING_YEAR_PROXY - DF["yr_of_estab"]).astype("float")
DF["company_age"] = DF["company_age"].where(DF["company_age"].between(0, 200))  # clip implausible tails
DF["log_company_age"] = np.log1p(DF["company_age"])

# IQR outlier summary (1.5×IQR fences
def iqr_outlier_mask(x):
    x = pd.to_numeric(x, errors="coerce")
    q1, q3 = np.nanpercentile(x, [25, 75])
    iqr = q3 - q1
    lo, hi = q1 - 1.5*iqr, q3 + 1.5*iqr
    return (x < lo) | (x > hi)

def _report(block_name, series_map):
    rows = []
    n = len(DF)
    for label, s in series_map.items():
        m = iqr_outlier_mask(s)
        n_out = int(pd.Series(m).sum())
        rows.append([block_name, label, n, n_out, round(100*n_out/n, 2)])
    return pd.DataFrame(rows, columns=["feature","variant","n","n_outliers","pct_outliers"])

rep_wage = _report("wage", {
    "prevailing_wage (raw, mixed units)": DF["prevailing_wage"],
    "wage_usd_yr (normalized)": DF["wage_usd_yr"],
    "log_wage_usd_yr": DF["log_wage_usd_yr"],
})
rep_emp  = _report("employees", {
    "no_of_employees (raw)": DF["employees_raw"],
    "employees_clip0": DF["employees_clip0"],
    "log_employees": DF["log_employees"],
})
rep_age  = _report("company_age", {
    "yr_of_estab (raw)": DF["yr_of_estab"],
    "company_age": DF["company_age"],
    "log_company_age": DF["log_company_age"],
})

OUTLIER_SUMMARY = pd.concat([rep_wage, rep_emp, rep_age], ignore_index=True)

print("=== OUTLIER SUMMARY (IQR, 1.5×IQR) ===")
display(OUTLIER_SUMMARY)

# Quick visuals boxplots: raw → normalized → log1p
fig, axes = plt.subplots(1, 3, figsize=(13, 4))

# Wage
axes[0].boxplot(
    [DF["prevailing_wage"].dropna(),
     DF["wage_usd_yr"].dropna(),
     DF["log_wage_usd_yr"].dropna()],
    labels=["wage (raw)", "wage_usd_yr", "log(wage_usd_yr)"]
)
axes[0].set_title("Wage (raw → annualized → log1p)")

# Employees
axes[1].boxplot(
    [DF["employees_raw"].dropna(),
     DF["employees_clip0"].dropna(),
     DF["log_employees"].dropna()],
    labels=["raw", "clip0", "log1p"]
)
axes[1].set_title("Employees (raw → clip0 → log1p)")

# Company age
axes[2].boxplot(
    [DF["yr_of_estab"].dropna(),
     DF["company_age"].dropna(),
     DF["log_company_age"].dropna()],
    labels=["yr_of_estab", "age", "log1p(age)"]
)
axes[2].set_title("Company age (derivation & transform)")

plt.tight_layout()
plt.show()

# Notes for the next section
print("Notes:")
print("• Used log_wage_usd_yr for modeling; do not hard-drop wage outliers—log transform + possible winsorization is safer.")
print("• Keep employees ≥ 0 (Clipped); prefer log1p(employees_clip0) downstream.")
print("• Company age uses 2016 as a filing-year proxy; replace when a true filing year is available.")
=== OUTLIER SUMMARY (IQR, 1.5×IQR) ===
feature variant n n_outliers pct_outliers
0 wage prevailing_wage (raw, mixed units) 25480 427 1.68
1 wage wage_usd_yr (normalized) 25480 2387 9.37
2 wage log_wage_usd_yr 25480 2837 11.13
3 employees no_of_employees (raw) 25480 1556 6.11
4 employees employees_clip0 25480 1556 6.11
5 employees log_employees 25480 1915 7.52
6 company_age yr_of_estab (raw) 25480 3260 12.79
7 company_age company_age 25480 3196 12.54
8 company_age log_company_age 25480 23 0.09
No description has been provided for this image
Notes:
• Used log_wage_usd_yr for modeling; do not hard-drop wage outliers—log transform + possible winsorization is safer.
• Keep employees ≥ 0 (Clipped); prefer log1p(employees_clip0) downstream.
• Company age uses 2016 as a filing-year proxy; replace when a true filing year is available.
In [42]:
# Feature prep policy

# Config 
WINSORIZE = True            
TOP_P    = 0.01            
BOTTOM_P = 0.00             

# Ensure they exist:
req_cols = [
    "wage_usd_yr","log_wage_usd_yr",
    "employees_clip0","log_employees",
    "company_age","log_company_age"
]
missing_req = [c for c in req_cols if c not in DF.columns]
if missing_req:
    raise KeyError(f"Missing expected columns from prior step: {missing_req}")

# Winsorization helper
def winsorize(s: pd.Series, lo=0.0, hi=0.01):
    x = pd.to_numeric(s, errors="coerce")
    lo_b = np.nanquantile(x, lo) if lo > 0 else None
    hi_b = np.nanquantile(x, 1-hi) if hi > 0 else None
    if lo_b is not None:
        x = np.maximum(x, lo_b)
    if hi_b is not None:
        x = np.minimum(x, hi_b)
    return pd.Series(x, index=s.index)

# Create winsorized copies 
if WINSORIZE:
    DF["wage_usd_yr_cap"]     = winsorize(DF["wage_usd_yr"], lo=BOTTOM_P, hi=TOP_P)
    DF["employees_clip0_cap"] = winsorize(DF["employees_clip0"], lo=0.0, hi=TOP_P)
else:
    DF["wage_usd_yr_cap"]     = DF["wage_usd_yr"]
    DF["employees_clip0_cap"] = DF["employees_clip0"]

# Imputation global median
for col in ["wage_usd_yr_cap", "employees_clip0_cap", "company_age"]:
    if DF[col].isna().any():
        med = DF[col].median()
        DF[col] = DF[col].fillna(med)

# Recompute logs if any imputation changed bases
DF["log_wage_usd_yr"] = np.log1p(DF["wage_usd_yr_cap"])
DF["log_employees"]   = np.log1p(DF["employees_clip0_cap"])
DF["log_company_age"] = np.log1p(DF["company_age"])

# Final feature inventories numeric only
NUMERIC_AVAILABLE = [
    # wage
    "wage_usd_yr_cap", "log_wage_usd_yr",
    # employees
    "employees_clip0_cap", "log_employees",
    # company age
    "company_age", "log_company_age",
]

# Preferred sets by model family
FEATURES_FOR_LINEAR = [
    # prefer logs for linear/logistic models
    "log_wage_usd_yr", "log_employees", "log_company_age",
]

FEATURES_FOR_TREES = [
    # keep both raw-ish and log as helpers
    "wage_usd_yr_cap", "log_wage_usd_yr",
    "employees_clip0_cap", "log_employees",
    "company_age", "log_company_age",
]

# Sanity: no NaNs in chosen features
def _nan_report(cols):
    rpt = DF[cols].isna().sum().sort_values(ascending=False)
    has_nans = int(rpt.sum()) > 0
    return has_nans, rpt

has_nans_lin, rpt_lin = _nan_report(FEATURES_FOR_LINEAR)
has_nans_tree, rpt_tree = _nan_report(FEATURES_FOR_TREES)

print("=== Feature Policy Summary ===")
print("Winsorization:", WINSORIZE, f"(top={TOP_P:.3f}, bottom={BOTTOM_P:.3f})")
print("\nNumeric features available:", NUMERIC_AVAILABLE)
print("\nLinear/Logistic preferred features:", FEATURES_FOR_LINEAR)
print("Tree/XGB features:", FEATURES_FOR_TREES)

if has_nans_lin or has_nans_tree:
    print("\n[WARN] NaNs detected in selected features.")
    print("\nNaN counts (Linear/Logistic):")
    display(rpt_lin[rpt_lin>0])
    print("\nNaN counts (Tree/XGB):")
    display(rpt_tree[rpt_tree>0])
    # Final safety imputer 
    DF[FEATURES_FOR_LINEAR] = DF[FEATURES_FOR_LINEAR].fillna(DF[FEATURES_FOR_LINEAR].median())
    DF[FEATURES_FOR_TREES]  = DF[FEATURES_FOR_TREES].fillna(DF[FEATURES_FOR_TREES].median())
else:
    print("\n[OK] No NaNs in selected feature sets.")

# Documentation note
print("\nNote: company_age uses filing-year proxy 2016 (per dataset convention). Update if a true filing year is added.")
=== Feature Policy Summary ===
Winsorization: True (top=0.010, bottom=0.000)

Numeric features available: ['wage_usd_yr_cap', 'log_wage_usd_yr', 'employees_clip0_cap', 'log_employees', 'company_age', 'log_company_age']

Linear/Logistic preferred features: ['log_wage_usd_yr', 'log_employees', 'log_company_age']
Tree/XGB features: ['wage_usd_yr_cap', 'log_wage_usd_yr', 'employees_clip0_cap', 'log_employees', 'company_age', 'log_company_age']

[OK] No NaNs in selected feature sets.

Note: company_age uses filing-year proxy 2016 (per dataset convention). Update if a true filing year is added.
In [43]:
# FULL PREFLIGHT AUDIT 

#Respect project seed
try:
    RNG
except NameError:
    RNG = 72
np.random.seed(RNG)

print("=== EASYVISA: END-TO-END PREPROCESSING AUDIT ===")
print(f"[INFO] RNG seed: {RNG}")

# Basic presence
RAW_REQ = {
    "case_id","continent","education_of_employee","has_job_experience",
    "requires_job_training","no_of_employees","yr_of_estab",
    "region_of_employment","prevailing_wage","unit_of_wage",
    "full_time_position","case_status"
}
CANON_REQ = {
    "_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time",
    "_case_label","y"
}
ENG_NUM_REQ = {
    "wage_usd_yr", "wage_usd_yr_cap","log_wage_usd_yr",
    "employees_clip0","employees_clip0_cap","log_employees",
    "company_age","log_company_age"
}
CAT_ALL = ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"]
CAT_COLS = [c for c in CAT_ALL if c in DF.columns]

def _need(cols, name):
    miss = sorted([c for c in cols if c not in DF.columns])
    status = "OK" if not miss else f"MISSING → {miss}"
    print(f"[CHECK] {name}: {status}")
    if miss: raise AssertionError(f"{name} missing: {miss}")

_need(RAW_REQ,   "Raw columns")
_need(CANON_REQ, "Canonical columns")
_need(ENG_NUM_REQ, "Engineered numeric columns")

print(f"[INFO] Shape: {len(DF):,} rows × {DF.shape[1]} cols\n")

# Target + class mix
vc_lbl = DF["_case_label"].value_counts()
vc_y   = DF["y"].value_counts()
pos_rate = DF["y"].mean()
print("— Target balance —")
print(vc_lbl.to_string())
print(vc_y.to_string())
print(f"[OK] Positive rate (P(certified)) = {pos_rate:.4f}\n")

# EDA invariants verify-only
def _norm(s): return s.astype("string").str.strip().str.lower()

# a) Education monotonic: HS < Bach < Mast < Doc
edu_order = ["high school","bachelor's","master's","doctorate"]
edu_rates = (DF.assign(_edu=_norm(DF["_edu"]))
               .groupby("_edu")["y"].mean()
               .reindex(edu_order))
if edu_rates.isna().any():
    print("[WARN] Education levels missing for monotonic check:", edu_rates[edu_rates.isna()].index.tolist())
else:
    mono = np.all(np.diff(edu_rates.values) > 0)
    print("— EDA invariant: Education monotonic (HS<Bach<Mast<Doc) —")
    print(edu_rates.round(4).to_string())
    print("[OK] Monotonic ↑ confirmed.\n" if mono else "[FAIL] Not monotonic!\n")

# b) Job experience: Yes > No
exp_rates = (DF.assign(_job_exp=_norm(DF["_job_exp"]))
               .groupby("_job_exp")["y"].mean())
if {"y","n"}.issubset(exp_rates.index):
    print("— EDA invariant: Job experience (Y > N) —")
    print(exp_rates.round(4).to_string())
    if not (exp_rates["y"] > exp_rates["n"]):
        raise AssertionError("Job experience invariant failed.")
    print("[OK] Experience improves approval.\n")
else:
    print("[WARN] _job_exp not normalized to {'y','n'}; skipping.\n")

# c) Unit-of-wage broad pattern: Year > Hour
uow_rates = (DF.assign(_uow=_norm(DF["_uow"]))
               .groupby("_uow")["y"].mean())
if {"year","hour"}.issubset(uow_rates.index):
    print("— EDA invariant: Unit-of-wage (Year > Hour) —")
    print(uow_rates.round(4).to_string())
    if not (uow_rates["year"] > uow_rates["hour"]):
        raise AssertionError("UoW invariant failed.")
    print("[OK] Broad pattern holds.\n")
else:
    print("[WARN] Missing 'year' or 'hour' in _uow; skipping.\n")

# Engineering policy audit
print("— Engineering policy —")
print("Annualization present: wage_usd_yr")
print("Log features present:  log_wage_usd_yr, log_employees, log_company_age")
print("Clip/Winsor present:   employees_clip0(+_cap), wage_usd_yr_cap")
print("Age proxy present:     company_age (2016 proxy)\n")

# 3a) Annualization mapping sanity + impute count estimate from raw
def _map_mult(u):
    if pd.isna(u): return np.nan
    s = str(u).strip().lower()
    if s.startswith("hour") or s in {"hr","h"}: return 2080.0
    if s.startswith("week") or s in {"wk","w"}: return 52.0
    if s.startswith("month") or s in {"mo","m"}: return 12.0
    if s.startswith("year") or s in {"yr","y","annual"}: return 1.0
    if s.startswith("bi"): return 26.0
    if s.startswith("day"): return 260.0
    return np.nan

annual_raw = pd.to_numeric(DF["prevailing_wage"], errors="coerce") * DF["unit_of_wage"].map(_map_mult)
annual_na  = int(annual_raw.isna().sum())
cap_na     = int(DF["wage_usd_yr_cap"].isna().sum())
print(f"[INFO] Annualization reconstruct: NaNs before policy ≈ {annual_na:,}")
print(f"[INFO] After policy (cap+impute): wage_usd_yr_cap NaNs = {cap_na}\n")
if cap_na != 0:
    raise AssertionError("wage_usd_yr_cap still has NaNs.")

# 3b) Negatives clipped in employees
neg_emp = int(pd.to_numeric(DF["no_of_employees"], errors="coerce").lt(0).sum())
print(f"[INFO] Negative employees identified & clipped: {neg_emp:,}\n")

# 3c) Winsorization counts top caps
def _winsor_counts(base, capped):
    base = pd.to_numeric(base, errors="coerce")
    capped = pd.to_numeric(capped, errors="coerce")
    mask_hi = capped < base  # only top-capping used in our policy
    return int(mask_hi.sum())

w_cap = _winsor_counts(DF["wage_usd_yr"], DF["wage_usd_yr_cap"])
e_cap = _winsor_counts(DF["employees_clip0"], DF["employees_clip0_cap"])
print("— Winsorization (top 1%) impact —")
print(f"wage_usd_yr:     capped rows = {w_cap:,}")
print(f"employees_clip0: capped rows = {e_cap:,}\n")

# 3d) Imputation counts actually applied
def _impute_count(col_cap, base_na_series=None):
    if base_na_series is not None:
        return int(base_na_series.sum())
    # fallback (if not reconstructible): report 0 if col has no NaNs now
    return 0

emp_base_na = pd.to_numeric(DF["no_of_employees"], errors="coerce").isna()
age_base_na = ~pd.to_numeric(DF["yr_of_estab"], errors="coerce").between(0, 9999)  # coarse
imp_wage = _impute_count("wage_usd_yr_cap", annual_raw.isna())
imp_emp  = _impute_count("employees_clip0_cap", emp_base_na)
imp_age  = _impute_count("company_age", age_base_na)
print("— Median imputations —")
print(f"wage_usd_yr_cap: ~{imp_wage:,} rows filled")
print(f"employees_clip0_cap: ~{imp_emp:,} rows filled")
print(f"company_age: ~{imp_age:,} rows filled\n")

# 3e) Numeric diagnostics post-policy
def _num_diag(col):
    s = pd.to_numeric(DF[col], errors="coerce")
    q = s.quantile([.01,.25,.5,.75,.99]).round(3)
    return {
        "min": float(s.min()),
        "p01": float(q.loc[.01]),
        "p25": float(q.loc[.25]),
        "p50": float(q.loc[.5]),
        "p75": float(q.loc[.75]),
        "p99": float(q.loc[.99]),
        "max": float(s.max())
    }

NUM_CHECKS = ["wage_usd_yr","wage_usd_yr_cap","log_wage_usd_yr",
              "employees_clip0","employees_clip0_cap","log_employees",
              "company_age","log_company_age"]

print("— Numeric feature ranges (post-policy) —")
for c in NUM_CHECKS:
    d = _num_diag(c)
    print(f"{c:>20}: min={d['min']:.3g}  p01={d['p01']:.3g}  p50={d['p50']:.3g}  p99={d['p99']:.3g}  max={d['max']:.3g}")
print()

# Categorical cleaning + rare bucketing 
print("— Categoricals (cleaned & rare-bucketed) —")
for c in CAT_COLS:
    s = DF[c].astype("string")
    nlev = int(s.nunique())
    n_unknown = int((s == "unknown").sum())
    n_other = int((s == f"{c}_other").sum())
    top = s.value_counts().head(6).to_dict()
    print(f"{c:>12}: levels={nlev}  'unknown'={n_unknown:,}  '{c}_other'={n_other:,}  top6={top}")
print()

# 5) Dedupe
if "case_id" in DF.columns:
    dups = int(DF["case_id"].duplicated().sum())
    print(f"— Dedupe —\nDuplicate case_id rows present: {dups}")
    if dups != 0:
        raise AssertionError("Duplicates remain on case_id.")
    print("[OK] No duplicate IDs.\n")

# Feature sets + NaN guards
try:
    FEATURES_FOR_LINEAR
    FEATURES_FOR_TREES
except NameError:
    # Fall back to the policy we defined in prep
    FEATURES_FOR_LINEAR = ["log_wage_usd_yr","log_employees","log_company_age"] + CAT_COLS
    FEATURES_FOR_TREES  = ["wage_usd_yr_cap","log_wage_usd_yr",
                           "employees_clip0_cap","log_employees",
                           "company_age","log_company_age"] + CAT_COLS

print("— Feature sets —")
print("Linear/Logistic:", FEATURES_FOR_LINEAR)
print("Tree/XGB:      ", FEATURES_FOR_TREES)

def _nan_sum(cols): return int(DF[cols].isna().sum().sum())
nan_lin  = _nan_sum(FEATURES_FOR_LINEAR)
nan_tree = _nan_sum(FEATURES_FOR_TREES)
print(f"[CHECK] NaNs in Linear set: {nan_lin} | NaNs in Tree set: {nan_tree}")
if nan_lin or nan_tree:
    raise AssertionError("NaNs remain in selected features.")
print("[OK] No NaNs in selected feature sets.\n")

# 7) Final readiness
print("=== PREPROCESSING AUDIT: PASS ✓  — Ready for modeling ===")
=== EASYVISA: END-TO-END PREPROCESSING AUDIT ===
[INFO] RNG seed: 72
[CHECK] Raw columns: OK
[CHECK] Canonical columns: OK
[CHECK] Engineered numeric columns: OK
[INFO] Shape: 25,480 rows × 30 cols

— Target balance —
_case_label
Certified    17018
Denied        8462
y
1    17018
0     8462
[OK] Positive rate (P(certified)) = 0.6679

— EDA invariant: Education monotonic (HS<Bach<Mast<Doc) —
_edu
high school    0.3404
bachelor's     0.6221
master's       0.7863
doctorate      0.8723
[OK] Monotonic ↑ confirmed.

— EDA invariant: Job experience (Y > N) —
_job_exp
n    0.5613
y    0.7448
[OK] Experience improves approval.

— EDA invariant: Unit-of-wage (Year > Hour) —
_uow
hour     0.3463
month     0.618
week     0.6213
year     0.6989
[OK] Broad pattern holds.

— Engineering policy —
Annualization present: wage_usd_yr
Log features present:  log_wage_usd_yr, log_employees, log_company_age
Clip/Winsor present:   employees_clip0(+_cap), wage_usd_yr_cap
Age proxy present:     company_age (2016 proxy)

[INFO] Annualization reconstruct: NaNs before policy ≈ 0
[INFO] After policy (cap+impute): wage_usd_yr_cap NaNs = 0

[INFO] Negative employees identified & clipped: 33

— Winsorization (top 1%) impact —
wage_usd_yr:     capped rows = 255
employees_clip0: capped rows = 255

— Median imputations —
wage_usd_yr_cap: ~0 rows filled
employees_clip0_cap: ~0 rows filled
company_age: ~0 rows filled

— Numeric feature ranges (post-policy) —
         wage_usd_yr: min=100  p01=2.63e+03  p50=8.28e+04  p99=2.04e+06  max=1.46e+07
     wage_usd_yr_cap: min=100  p01=2.63e+03  p50=8.28e+04  p99=2.04e+06  max=2.04e+06
     log_wage_usd_yr: min=4.62  p01=7.88  p50=11.3  p99=14.5  max=14.5
     employees_clip0: min=0  p01=52  p50=2.11e+03  p99=1.03e+05  max=6.02e+05
 employees_clip0_cap: min=0  p01=52  p50=2.11e+03  p99=1.03e+05  max=1.03e+05
       log_employees: min=0  p01=3.97  p50=7.65  p99=11.5  max=11.5
         company_age: min=0  p01=2  p50=19  p99=178  max=199
     log_company_age: min=0  p01=1.1  p50=3  p99=5.19  max=5.3

— Categoricals (cleaned & rare-bucketed) —
        _edu: levels=4  'unknown'=0  '_edu_other'=0  top6={"Bachelor's": 10234, "Master's": 9634, 'High School': 3420, 'Doctorate': 2192}
  _continent: levels=6  'unknown'=0  '_continent_other'=0  top6={'Asia': 16861, 'Europe': 3732, 'North America': 3292, 'South America': 852, 'Africa': 551, 'Oceania': 192}
     _region: levels=5  'unknown'=0  '_region_other'=0  top6={'Northeast': 7195, 'South': 7017, 'West': 6586, 'Midwest': 4307, 'Island': 375}
        _uow: levels=4  'unknown'=0  '_uow_other'=0  top6={'Year': 22962, 'Hour': 2157, 'Week': 272, 'Month': 89}
    _job_exp: levels=2  'unknown'=0  '_job_exp_other'=0  top6={'Y': 14802, 'N': 10678}
  _job_train: levels=2  'unknown'=0  '_job_train_other'=0  top6={'N': 22525, 'Y': 2955}
  _full_time: levels=2  'unknown'=0  '_full_time_other'=0  top6={'Y': 22773, 'N': 2707}

— Dedupe —
Duplicate case_id rows present: 0
[OK] No duplicate IDs.

— Feature sets —
Linear/Logistic: ['log_wage_usd_yr', 'log_employees', 'log_company_age']
Tree/XGB:       ['wage_usd_yr_cap', 'log_wage_usd_yr', 'employees_clip0_cap', 'log_employees', 'company_age', 'log_company_age']
[CHECK] NaNs in Linear set: 0 | NaNs in Tree set: 0
[OK] No NaNs in selected feature sets.

=== PREPROCESSING AUDIT: PASS ✓  — Ready for modeling ===

Model Building¶

Model Evaluation Criterion¶

  • Choose the primary metric to evaluate the model on
  • Elaborate on the rationale behind choosing the metric

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

  • The model_performance_classification_sklearn function will be used to check the model performance of models.
  • The confusion_matrix_sklearn function will be used to plot the confusion matrix.
In [44]:
# Evaluation utilities for classification threshold-free + thresholded
from sklearn.metrics import (
    roc_auc_score, average_precision_score, roc_curve, precision_recall_curve,
    confusion_matrix, accuracy_score, precision_score, recall_score, f1_score,
    brier_score_loss, log_loss
)

def choose_threshold(y_true, y_prob, *, method="youden", target=None):
    """
    Optional helper to pick an operating threshold on validation/OOF probabilities.
    method: 'youden' | 'f1' | 'precision_at' | 'recall_at'
      - 'precision_at' or 'recall_at' require target in (0,1].
    Returns (threshold, summary_dict).
    """
    y_true = np.asarray(y_true).astype(int)
    y_prob = np.asarray(y_prob).astype(float)

    if method == "youden":
        fpr, tpr, thr = roc_curve(y_true, y_prob)
        j = tpr - fpr
        t  = float(thr[int(np.argmax(j))])
    elif method == "f1":
        grid = np.quantile(y_prob, np.linspace(0.01, 0.99, 199))
        best, t = -1, 0.5
        for g in grid:
            pred = (y_prob >= g).astype(int)
            f1 = f1_score(y_true, pred, zero_division=0)
            if f1 > best:
                best, t = f1, float(g)
    elif method == "precision_at":
        assert target is not None and 0 < target <= 1
        prec, rec, thr = precision_recall_curve(y_true, y_prob)
        thr = np.r_[thr, 1.0]
        mask = prec >= target
        t = float(thr[mask][-1]) if mask.any() else 1.0
    elif method == "recall_at":
        assert target is not None and 0 < target <= 1
        prec, rec, thr = precision_recall_curve(y_true, y_prob)
        thr = np.r_[thr, 1.0]
        mask = rec >= target
        t = float(thr[mask][-1]) if mask.any() else 1.0
    else:
        raise ValueError("Unsupported method.")

    pred = (y_prob >= t).astype(int)
    return t, dict(
        threshold=t,
        precision=precision_score(y_true, pred, zero_division=0),
        recall=recall_score(y_true, pred, zero_division=0),
        f1=f1_score(y_true, pred, zero_division=0),
        accuracy=accuracy_score(y_true, pred),
    )

def model_performance_classification_sklearn(y_true, y_prob, *, threshold=0.5):
    """
    Computes threshold-free metrics (ROC-AUC, PR-AUC, KS, Brier, LogLoss) and
    thresholded metrics (Accuracy, Precision, Recall, F1) at `threshold`.
    Returns (metrics_dict, y_pred_binary).
    """
    y_true = np.asarray(y_true).astype(int)
    y_prob = np.asarray(y_prob).astype(float)
    y_pred = (y_prob >= threshold).astype(int)

    # Threshold-free
    rocAUC = roc_auc_score(y_true, y_prob)
    prAUC  = average_precision_score(y_true, y_prob)  # PR-AUC/AP for positive class
    fpr, tpr, _ = roc_curve(y_true, y_prob)
    ks = float(np.max(tpr - fpr))
    brier = brier_score_loss(y_true, y_prob)
    ll = log_loss(y_true, np.c_[1 - y_prob, y_prob], labels=[0, 1])

    # Thresholded
    out = dict(
        roc_auc=rocAUC, pr_auc=prAUC, ks=ks, brier=brier, log_loss=ll,
        accuracy=accuracy_score(y_true, y_pred),
        precision=precision_score(y_true, y_pred, zero_division=0),
        recall=recall_score(y_true, y_pred, zero_division=0),
        f1=f1_score(y_true, y_pred, zero_division=0),
        threshold=float(threshold),
    )
    return out, y_pred

def confusion_matrix_sklearn(y_true, y_prob, *, threshold, labels=("Denied","Certified")):
    """
    Plots a 2x2 confusion matrix at the chosen probability threshold.
    Returns (cm_array, counts_dict).
    """
    y_true = np.asarray(y_true).astype(int)
    y_pred = (np.asarray(y_prob) >= float(threshold)).astype(int)
    cm = confusion_matrix(y_true, y_pred, labels=[0, 1])
    tn, fp, fn, tp = cm.ravel()

    fig, ax = plt.subplots(figsize=(4.6, 4))
    ax.imshow(cm, interpolation="nearest")
    ax.set_xticks([0, 1]); ax.set_yticks([0, 1])
    ax.set_xticklabels(labels); ax.set_yticklabels(labels)
    ax.set_xlabel("Predicted"); ax.set_ylabel("Actual")
    ax.set_title(f"Confusion matrix @ threshold={threshold:.3f}")
    for i in range(2):
        for j in range(2):
            ax.text(j, i, f"{cm[i, j]:,}", ha="center", va="center")
    plt.tight_layout(); plt.show()

    return cm, dict(tn=int(tn), fp=int(fp), fn=int(fn), tp=int(tp))
In [45]:
# Build pipelines if missing + OOF evaluation

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold

# Guards use existing objects
assert "RNG" in globals(), "RNG must be defined in your config cell."
assert "DF" in globals(), "DF not found."
assert "FEATURES_FOR_LINEAR" in globals() and "FEATURES_FOR_TREES" in globals(), "Feature lists missing."

CAT_COLS = [c for c in ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"] if c in DF.columns]
NUM_LINEAR = [c for c in FEATURES_FOR_LINEAR if c in DF.columns and c not in CAT_COLS]
NUM_TREES  = [c for c in FEATURES_FOR_TREES  if c in DF.columns and c not in CAT_COLS]
TARGET = "y"
y = DF[TARGET].astype(int).to_numpy()

# Define metrics helpers
if "choose_threshold" not in globals():
    def choose_threshold(y_true, y_prob, *, method="youden", target=None):
        y_true = np.asarray(y_true).astype(int)
        y_prob = np.asarray(y_prob).astype(float)
        fpr, tpr, thr_roc = roc_curve(y_true, y_prob)
        if method == "youden":
            j = tpr - fpr
            thr = float(thr_roc[int(np.argmax(j))])
        elif method == "f1":
            grid = np.quantile(y_prob, np.linspace(0.01, 0.99, 199))
            best, thr = -1, 0.5
            for t in grid:
                yp = (y_prob >= t).astype(int)
                f1 = f1_score(y_true, yp, zero_division=0)
                if f1 > best: best, thr = f1, float(t)
        elif method == "precision_at":
            assert target is not None and 0 < target <= 1
            prec, rec, thr_pr = precision_recall_curve(y_true, y_prob)
            thr_pr = np.r_[thr_pr, 1.0]
            mask = prec >= target
            thr = float(thr_pr[mask][-1]) if mask.any() else 1.0
        elif method == "recall_at":
            assert target is not None and 0 < target <= 1
            prec, rec, thr_pr = precision_recall_curve(y_true, y_prob)
            thr_pr = np.r_[thr_pr, 1.0]
            mask = rec >= target
            thr = float(thr_pr[mask][-1]) if mask.any() else 1.0
        else:
            raise ValueError("Unsupported method.")
        yp = (y_prob >= thr).astype(int)
        return thr, dict(
            threshold=thr,
            precision=precision_score(y_true, yp, zero_division=0),
            recall=recall_score(y_true, yp, zero_division=0),
            f1=f1_score(y_true, yp, zero_division=0),
            accuracy=accuracy_score(y_true, yp),
        )

if "model_performance_classification_sklearn" not in globals():
    def model_performance_classification_sklearn(y_true, y_prob, *, threshold=0.5):
        y_true = np.asarray(y_true).astype(int)
        y_prob = np.asarray(y_prob).astype(float)
        y_pred = (y_prob >= threshold).astype(int)
        rocAUC = roc_auc_score(y_true, y_prob)
        prAUC  = average_precision_score(y_true, y_prob)
        fpr, tpr, _ = roc_curve(y_true, y_prob)
        ks = float(np.max(tpr - fpr))
        brier = brier_score_loss(y_true, y_prob)
        ll = log_loss(y_true, np.c_[1 - y_prob, y_prob], labels=[0,1])
        return dict(
            roc_auc=rocAUC, pr_auc=prAUC, ks=ks, brier=brier, log_loss=ll,
            accuracy=accuracy_score(y_true, y_pred),
            precision=precision_score(y_true, y_pred, zero_division=0),
            recall=recall_score(y_true, y_pred, zero_division=0),
            f1=f1_score(y_true, y_pred, zero_division=0),
            threshold=threshold
        ), y_pred

def confusion_matrix_sklearn(y_true, y_prob, *, threshold, labels=("Denied","Certified")):
    y_pred = (np.asarray(y_prob) >= float(threshold)).astype(int)
    cm = confusion_matrix(np.asarray(y_true).astype(int), y_pred, labels=[0,1])
    fig, ax = plt.subplots(figsize=(4.6,4))
    ax.imshow(cm, interpolation="nearest", cmap="Pastel1_r")
    ax.set_xticks([0,1]); ax.set_yticks([0,1])
    ax.set_xticklabels(labels); ax.set_yticklabels(labels)
    ax.set_xlabel("Predicted"); ax.set_ylabel("Actual")
    ax.set_title(f"Confusion matrix @ threshold={threshold:.3f}")
    for i in range(2):
        for j in range(2):
            ax.text(j, i, f"{cm[i,j]:,}", ha="center", va="center")
    plt.tight_layout(); plt.show()
    return cm

# Build preprocessors/pipelines ONLY if missing
if "pipe_logit" not in globals():
    pp_linear = ColumnTransformer(
        [("num", StandardScaler(), NUM_LINEAR),
         ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), CAT_COLS)],
        remainder="drop", verbose_feature_names_out=False
    )
    pipe_logit = Pipeline([("prep", pp_linear),
                           ("model", LogisticRegression(max_iter=2000, solver="lbfgs", class_weight="balanced"))])

if "pipe_forest" not in globals():
    pp_trees = ColumnTransformer(
        [("num", "passthrough", NUM_TREES),
         ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), CAT_COLS)],
        remainder="drop", verbose_feature_names_out=False
    )
    pipe_forest = Pipeline([("prep", pp_trees),
                            ("model", RandomForestClassifier(n_estimators=600, n_jobs=-1,
                                                            random_state=RNG,
                                                            class_weight="balanced_subsample"))])

MODELS = {
    "Logit (logs+OHE+cw)": pipe_logit,
    "RF (raw+log+OHE+cw)":  pipe_forest,
}

# OOF evaluation ROC-AUC primary, PR-AUC secondary
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=RNG)

def oof_eval(name, pipe):
    oof = np.zeros_like(y, dtype=float)
    for tr, te in skf.split(DF, y):
        pipe.fit(DF.iloc[tr], y[tr])
        oof[te] = pipe.predict_proba(DF.iloc[te])[:, 1]
    thr, thr_summ = choose_threshold(y, oof, method="youden")
    perf, _ = model_performance_classification_sklearn(y, oof, threshold=thr)
    row = dict(model=name, **perf, **{f"thr_{k}":v for k,v in thr_summ.items()})
    return row, oof, thr

rows, oofs, thrs = [], {}, {}
for name, pipe in MODELS.items():
    row, oof, thr = oof_eval(name, pipe)
    rows.append(row); oofs[name] = oof; thrs[name] = thr

eval_tbl = (pd.DataFrame(rows)
            .loc[:, ["model","roc_auc","pr_auc","ks","brier","log_loss",
                     "accuracy","precision","recall","f1","threshold"]]
            .sort_values(["roc_auc","pr_auc"], ascending=False)
            .reset_index(drop=True)
            .round(4))

print("=== Model Selection (OOF) — ROC-AUC primary, PR-AUC secondary ===")
display(eval_tbl)

# Confusion matrices
for name in MODELS.keys():
    print(f"\n--- {name} — confusion matrix @ thr={thrs[name]:.3f} ---")
    confusion_matrix_sklearn(y, oofs[name], threshold=thrs[name], labels=("Denied","Certified"))
=== Model Selection (OOF) — ROC-AUC primary, PR-AUC secondary ===
model roc_auc pr_auc ks brier log_loss accuracy precision recall f1 threshold
0 Logit (logs+OHE+cw) 0.7701 0.8649 0.3978 0.1953 0.5785 0.6901 0.8311 0.6728 0.7436 0.5234
1 RF (raw+log+OHE+cw) 0.7607 0.8628 0.3842 0.1820 0.5404 0.6969 0.8151 0.7064 0.7569 0.6633
--- Logit (logs+OHE+cw) — confusion matrix @ thr=0.523 ---
No description has been provided for this image
--- RF (raw+log+OHE+cw) — confusion matrix @ thr=0.663 ---
No description has been provided for this image
In [46]:
# Visualize BASELINES (Original): Logistic + RandomForest
# - Logistic: top coefficients (±) + odds ratios
# - RandomForest: one tree (depth=3 view) + rules + importances

from sklearn.tree import plot_tree, export_text

# 0) Pull fitted pipelines & feature names
def _extract_estimator_from_calibrator(cal):
    """Return the fitted Pipeline(prep+model) from a CalibratedClassifierCV across sklearn versions."""
    # Preferred: inside calibrated_classifiers_[0]
    if hasattr(cal, "calibrated_classifiers_") and cal.calibrated_classifiers_:
        cc = cal.calibrated_classifiers_[0]
        for attr in ("estimator", "estimator_", "base_estimator", "classifier", "clf"):
            est = getattr(cc, attr, None)
            if isinstance(est, Pipeline):
                return est
    # Fallbacks sometimes exposed on the CalibratedClassifierCV itself
    for attr in ("base_estimator_", "base_estimator", "estimator"):
        est = getattr(cal, attr, None)
        if isinstance(est, Pipeline):
            return est
    return None  # let caller decide how to rebuild

# Try to get the fitted pipelines from the calibrators
logit_pipe = pipe_logit
rf_pipe    = pipe_forest

# If not found (older/newer sklearn), rebuild minimal equivalents for VIS ONLY
if logit_pipe is None or rf_pipe is None:
    # Rebuild the same preprocessors used for baselines
    CAT_COLS   = [c for c in ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"] if c in DF_TR.columns]
    NUM_LINEAR = [c for c in ["log_wage_usd_yr","log_employees","log_company_age"]                      if c in DF_TR.columns]
    NUM_TREES  = [c for c in ["wage_usd_yr_cap","log_wage_usd_yr","employees_clip0_cap",
                              "log_employees","company_age","log_company_age"]                         if c in DF_TR.columns]

    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier

    pp_linear = ColumnTransformer(
        [("num", StandardScaler(), NUM_LINEAR),
         ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), CAT_COLS)],
        remainder="drop", verbose_feature_names_out=False
    )
    pp_trees = ColumnTransformer(
        [("num", "passthrough", NUM_TREES),
         ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), CAT_COLS)],
        remainder="drop", verbose_feature_names_out=False
    )

    if logit_pipe is None:
        logit_pipe = Pipeline([("prep", pp_linear),
                               ("model", LogisticRegression(max_iter=2000, solver="lbfgs",
                                                            class_weight="balanced"))]).fit(DF_TR, y_tr)
    if rf_pipe is None:
        rf_pipe = Pipeline([("prep", pp_trees),
                            ("model", RandomForestClassifier(n_estimators=600, random_state=RNG,
                                                            n_jobs=-1, class_weight="balanced_subsample"))]).fit(DF_TR, y_tr)

# Now pull transformers/models/feature names
logit_prep  = logit_pipe.named_steps["prep"]
logit_model = logit_pipe.named_steps["model"]
logit_feats = logit_prep.get_feature_names_out()

rf_prep  = rf_pipe.named_steps["prep"]
rf_model = rf_pipe.named_steps["model"]
rf_feats = rf_prep.get_feature_names_out()

# 1) Logistic: coefficients & odds ratios 
coef = logit_model.coef_.ravel()
dfc = (pd.DataFrame({"feature": logit_feats, "coef": coef})
         .assign(abs_coef=lambda d: d["coef"].abs(),
                 odds_ratio=lambda d: np.exp(d["coef"]))
         .sort_values("abs_coef", ascending=False))

topk = 15
top_pos = dfc.sort_values("coef", ascending=False).head(topk)
top_neg = dfc.sort_values("coef", ascending=True).head(topk)
viz = pd.concat([top_neg, top_pos])

plt.figure(figsize=(8, 8))
colors = np.where(viz["coef"]>=0, "#2B83F6", "#FF8C42")
plt.barh(viz["feature"], viz["coef"], color=colors)
plt.gca().invert_yaxis()
plt.title("Logistic Regression — Top ± coefficients")
plt.xlabel("Coefficient (standardized)")
plt.tight_layout()
plt.show()

print("=== Logistic — Top features with odds ratios ===")
display(viz.loc[:, ["feature","coef","odds_ratio"]].round(4))

# 2) RandomForest: visualize one tree (depth=3) + rules 
tree_idx = 0   
tree0 = rf_model.estimators_[tree_idx]

plt.figure(figsize=(18, 10))
plot_tree(
    tree0,
    feature_names=rf_feats,
    class_names=["Denied","Certified"],
    filled=True, rounded=True, proportion=True, fontsize=7,
    max_depth=3 
)
plt.title(f"RandomForest — Tree #{tree_idx} (truncated view, max_depth=3)")
plt.tight_layout()
plt.show()

print(f"=== RandomForest — Rules for Tree #{tree_idx} (max_depth=3) ===")
print(export_text(tree0, feature_names=list(rf_feats), max_depth=3))

# 3) RandomForest: feature importances (top 20)
imp = pd.Series(rf_model.feature_importances_, index=rf_feats).sort_values(ascending=False).head(20)
plt.figure(figsize=(8, 6))
plt.barh(imp.index, imp.values)
plt.gca().invert_yaxis()
plt.title("RandomForest — Top feature importances")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()
No description has been provided for this image
=== Logistic — Top features with odds ratios ===
feature coef odds_ratio
5 _edu_High School -1.6565 0.1908
18 _uow_Hour -0.7447 0.4749
22 _job_exp_N -0.5450 0.5798
17 _region_West -0.3890 0.6777
12 _continent_South America -0.3827 0.6820
13 _region_Island -0.3278 0.7205
10 _continent_North America -0.2656 0.7668
3 _edu_Bachelor's -0.2298 0.7947
26 _full_time_N -0.2057 0.8141
15 _region_Northeast -0.1928 0.8247
8 _continent_Asia -0.1501 0.8606
24 _job_train_N -0.1043 0.9010
11 _continent_Oceania -0.0742 0.9284
25 _job_train_Y -0.0479 0.9533
19 _uow_Month -0.0409 0.9600
4 _edu_Doctorate 1.0944 2.9873
9 _continent_Europe 0.6457 1.9072
6 _edu_Master's 0.6398 1.8961
14 _region_Midwest 0.5303 1.6994
21 _uow_Year 0.5295 1.6981
23 _job_exp_Y 0.3929 1.4813
16 _region_South 0.2271 1.2550
20 _uow_Week 0.1039 1.1095
7 _continent_Africa 0.0748 1.0777
27 _full_time_Y 0.0535 1.0550
1 log_employees 0.0357 1.0364
0 log_wage_usd_yr 0.0011 1.0011
2 log_company_age -0.0110 0.9891
19 _uow_Month -0.0409 0.9600
25 _job_train_Y -0.0479 0.9533
No description has been provided for this image
=== RandomForest — Rules for Tree #0 (max_depth=3) ===
|--- _edu_Doctorate <= 0.50
|   |--- wage_usd_yr_cap <= 250637.73
|   |   |--- _region_West <= 0.50
|   |   |   |--- _job_exp_Y <= 0.50
|   |   |   |   |--- truncated branch of depth 39
|   |   |   |--- _job_exp_Y >  0.50
|   |   |   |   |--- truncated branch of depth 33
|   |   |--- _region_West >  0.50
|   |   |   |--- _uow_Hour <= 0.50
|   |   |   |   |--- truncated branch of depth 34
|   |   |   |--- _uow_Hour >  0.50
|   |   |   |   |--- truncated branch of depth 11
|   |--- wage_usd_yr_cap >  250637.73
|   |   |--- _job_exp_Y <= 0.50
|   |   |   |--- wage_usd_yr_cap <= 629553.50
|   |   |   |   |--- truncated branch of depth 23
|   |   |   |--- wage_usd_yr_cap >  629553.50
|   |   |   |   |--- truncated branch of depth 22
|   |   |--- _job_exp_Y >  0.50
|   |   |   |--- log_wage_usd_yr <= 14.39
|   |   |   |   |--- truncated branch of depth 17
|   |   |   |--- log_wage_usd_yr >  14.39
|   |   |   |   |--- truncated branch of depth 14
|--- _edu_Doctorate >  0.50
|   |--- _region_South <= 0.50
|   |   |--- log_wage_usd_yr <= 12.32
|   |   |   |--- _job_exp_Y <= 0.50
|   |   |   |   |--- truncated branch of depth 18
|   |   |   |--- _job_exp_Y >  0.50
|   |   |   |   |--- truncated branch of depth 16
|   |   |--- log_wage_usd_yr >  12.32
|   |   |   |--- log_employees <= 7.40
|   |   |   |   |--- truncated branch of depth 4
|   |   |   |--- log_employees >  7.40
|   |   |   |   |--- truncated branch of depth 10
|   |--- _region_South >  0.50
|   |   |--- employees_clip0_cap <= 9181.50
|   |   |   |--- log_employees <= 8.33
|   |   |   |   |--- truncated branch of depth 14
|   |   |   |--- log_employees >  8.33
|   |   |   |   |--- class: 1.0
|   |   |--- employees_clip0_cap >  9181.50
|   |   |   |--- log_employees <= 11.01
|   |   |   |   |--- truncated branch of depth 5
|   |   |   |--- log_employees >  11.01
|   |   |   |   |--- class: 1.0

No description has been provided for this image
In [47]:
# Freeze final choice per criterion
if "BEST_MODEL_NAME" not in globals():
    BEST_MODEL_NAME = "Logit (logs+OHE+cw)"
if "BEST_PIPE" not in globals():
    BEST_PIPE = pipe_logit
if "BEST_THRESHOLD" not in globals():
    BEST_THRESHOLD = thrs[BEST_MODEL_NAME]
if "BEST_OOF_PROB" not in globals():
    BEST_OOF_PROB = oofs[BEST_MODEL_NAME]

y_true = DF["y"].astype(int).to_numpy()

# Robust threshold picker
def pick_threshold(y_true, y_prob, *, method="youden", target=None):
    y_true = np.asarray(y_true).astype(int)
    y_prob = np.asarray(y_prob).astype(float)

    if method == "youden":
        fpr, tpr, thr = roc_curve(y_true, y_prob)
        j = tpr - fpr
        t = float(thr[int(np.argmax(j))])

    elif method == "f1":
        grid = np.quantile(y_prob, np.linspace(0.01, 0.99, 199))
        f1s = [f1_score(y_true, (y_prob >= g).astype(int), zero_division=0) for g in grid]
        t = float(grid[int(np.argmax(f1s))])

    elif method in {"precision_at", "recall_at"}:
        assert target is not None and 0 < target <= 1, "target must be in (0,1]"
        prec, rec, thr = precision_recall_curve(y_true, y_prob)
        # NOTE: `thr` has length len(prec)-1 == len(rec)-1 and aligns with prec[1:], rec[1:]
        prec_v, rec_v, thr_v = prec[1:], rec[1:], thr
        if method == "precision_at":
            ok = np.where(prec_v >= target)[0]
        else:  # recall_at
            ok = np.where(rec_v >= target)[0]
        if ok.size:
            # choose the HIGHEST threshold that still meets the target
            t = float(thr_v[ok].max())
        else:
            # fallback: use F1-optimal if target is unattainable
            grid = np.quantile(y_prob, np.linspace(0.01, 0.99, 199))
            f1s = [f1_score(y_true, (y_prob >= g).astype(int), zero_division=0) for g in grid]
            t = float(grid[int(np.argmax(f1s))])
    else:
        raise ValueError("Unsupported method.")

    return t

# 2) Compute operating points
t_you = float(BEST_THRESHOLD)                           # default (Youden from OOF)
t80   = pick_threshold(y_true, BEST_OOF_PROB, method="precision_at", target=0.80)
t90   = pick_threshold(y_true, BEST_OOF_PROB, method="recall_at",    target=0.90)

# 3) Metrics table at each threshold
def eval_at(th):
    perf, _ = model_performance_classification_sklearn(y_true, BEST_OOF_PROB, threshold=th)
    return pd.Series(perf).round(4)

tbl = pd.DataFrame(
    {"label": ["Youden (default)", "Precision≥0.80", "Recall≥0.90"],
     "threshold": [t_you, t80, t90]}
)
report = pd.concat([tbl, tbl["threshold"].apply(eval_at)], axis=1)

print(f"=== {BEST_MODEL_NAME} — operating points ===")
display(report)

# 4) Confusion matrices 
print("\nConfusion @ Youden:");    confusion_matrix_sklearn(y_true, BEST_OOF_PROB, threshold=t_you, labels=("Denied","Certified"))
print("\nConfusion @ Prec≥0.80:"); confusion_matrix_sklearn(y_true, BEST_OOF_PROB, threshold=t80, labels=("Denied","Certified"))
print("\nConfusion @ Rec≥0.90:");  confusion_matrix_sklearn(y_true, BEST_OOF_PROB, threshold=t90, labels=("Denied","Certified"))
=== Logit (logs+OHE+cw) — operating points ===
label threshold roc_auc pr_auc ks brier log_loss accuracy precision recall f1 threshold
0 Youden (default) 0.523426 0.7701 0.8649 0.3978 0.1953 0.5785 0.6901 0.8311 0.6728 0.7436 0.5234
1 Precision≥0.80 0.958036 0.7701 0.8649 0.3978 0.1953 0.5785 0.3321 1.0000 0.0001 0.0001 0.9580
2 Recall≥0.90 0.300785 0.7701 0.8649 0.3978 0.1953 0.5785 0.7343 0.7513 0.9001 0.8190 0.3008
Confusion @ Youden:
No description has been provided for this image
Confusion @ Prec≥0.80:
No description has been provided for this image
Confusion @ Rec≥0.90:
No description has been provided for this image
Out[47]:
array([[ 3392,  5070],
       [ 1700, 15318]])
In [48]:
# Use the fitted RandomForest pipeline (rf_pipe) and model (rf_model)
rf_model = rf_pipe.named_steps["model"]
rf_feats = rf_pipe.named_steps["prep"].get_feature_names_out()

# Thresholds for each operating point
thr_you = t_you
thr_prec = t80
thr_rec = t90

for label, thr in [("Youden (default)", thr_you), ("Precision≥0.80", thr_prec), ("Recall≥0.90", thr_rec)]:
    print(f"\n=== RandomForest Visualization @ {label} (threshold={thr:.4f}) ===")

    # 1) Visualize one tree (depth=3 view)
    tree_idx = 0
    tree0 = rf_model.estimators_[tree_idx]

    plt.figure(figsize=(18, 10))
    plot_tree(
        tree0,
        feature_names=rf_feats,
        class_names=["Denied", "Certified"],
        filled=True, rounded=True, proportion=True, fontsize=7,
        max_depth=3
    )
    plt.title(f"RandomForest — Tree #{tree_idx} (max_depth=3) @ {label}")
    plt.tight_layout()
    plt.show()

    # 2) Print rules for the tree (max_depth=3)
    print(f"=== RandomForest — Rules for Tree #{tree_idx} (max_depth=3) @ {label} ===")
    print(export_text(tree0, feature_names=list(rf_feats), max_depth=3))

    # 3) Feature importances (top 20)
    imp = pd.Series(rf_model.feature_importances_, index=rf_feats).sort_values(ascending=False).head(20)
    plt.figure(figsize=(8, 6))
    plt.barh(imp.index, imp.values)
    plt.gca().invert_yaxis()
    plt.title(f"RandomForest — Top feature importances @ {label}")
    plt.xlabel("Importance")
    plt.tight_layout()
    plt.show()
=== RandomForest Visualization @ Youden (default) (threshold=0.5234) ===
No description has been provided for this image
=== RandomForest — Rules for Tree #0 (max_depth=3) @ Youden (default) ===
|--- _edu_Doctorate <= 0.50
|   |--- wage_usd_yr_cap <= 250637.73
|   |   |--- _region_West <= 0.50
|   |   |   |--- _job_exp_Y <= 0.50
|   |   |   |   |--- truncated branch of depth 39
|   |   |   |--- _job_exp_Y >  0.50
|   |   |   |   |--- truncated branch of depth 33
|   |   |--- _region_West >  0.50
|   |   |   |--- _uow_Hour <= 0.50
|   |   |   |   |--- truncated branch of depth 34
|   |   |   |--- _uow_Hour >  0.50
|   |   |   |   |--- truncated branch of depth 11
|   |--- wage_usd_yr_cap >  250637.73
|   |   |--- _job_exp_Y <= 0.50
|   |   |   |--- wage_usd_yr_cap <= 629553.50
|   |   |   |   |--- truncated branch of depth 23
|   |   |   |--- wage_usd_yr_cap >  629553.50
|   |   |   |   |--- truncated branch of depth 22
|   |   |--- _job_exp_Y >  0.50
|   |   |   |--- log_wage_usd_yr <= 14.39
|   |   |   |   |--- truncated branch of depth 17
|   |   |   |--- log_wage_usd_yr >  14.39
|   |   |   |   |--- truncated branch of depth 14
|--- _edu_Doctorate >  0.50
|   |--- _region_South <= 0.50
|   |   |--- log_wage_usd_yr <= 12.32
|   |   |   |--- _job_exp_Y <= 0.50
|   |   |   |   |--- truncated branch of depth 18
|   |   |   |--- _job_exp_Y >  0.50
|   |   |   |   |--- truncated branch of depth 16
|   |   |--- log_wage_usd_yr >  12.32
|   |   |   |--- log_employees <= 7.40
|   |   |   |   |--- truncated branch of depth 4
|   |   |   |--- log_employees >  7.40
|   |   |   |   |--- truncated branch of depth 10
|   |--- _region_South >  0.50
|   |   |--- employees_clip0_cap <= 9181.50
|   |   |   |--- log_employees <= 8.33
|   |   |   |   |--- truncated branch of depth 14
|   |   |   |--- log_employees >  8.33
|   |   |   |   |--- class: 1.0
|   |   |--- employees_clip0_cap >  9181.50
|   |   |   |--- log_employees <= 11.01
|   |   |   |   |--- truncated branch of depth 5
|   |   |   |--- log_employees >  11.01
|   |   |   |   |--- class: 1.0

No description has been provided for this image
=== RandomForest Visualization @ Precision≥0.80 (threshold=0.9580) ===
No description has been provided for this image
=== RandomForest — Rules for Tree #0 (max_depth=3) @ Precision≥0.80 ===
|--- _edu_Doctorate <= 0.50
|   |--- wage_usd_yr_cap <= 250637.73
|   |   |--- _region_West <= 0.50
|   |   |   |--- _job_exp_Y <= 0.50
|   |   |   |   |--- truncated branch of depth 39
|   |   |   |--- _job_exp_Y >  0.50
|   |   |   |   |--- truncated branch of depth 33
|   |   |--- _region_West >  0.50
|   |   |   |--- _uow_Hour <= 0.50
|   |   |   |   |--- truncated branch of depth 34
|   |   |   |--- _uow_Hour >  0.50
|   |   |   |   |--- truncated branch of depth 11
|   |--- wage_usd_yr_cap >  250637.73
|   |   |--- _job_exp_Y <= 0.50
|   |   |   |--- wage_usd_yr_cap <= 629553.50
|   |   |   |   |--- truncated branch of depth 23
|   |   |   |--- wage_usd_yr_cap >  629553.50
|   |   |   |   |--- truncated branch of depth 22
|   |   |--- _job_exp_Y >  0.50
|   |   |   |--- log_wage_usd_yr <= 14.39
|   |   |   |   |--- truncated branch of depth 17
|   |   |   |--- log_wage_usd_yr >  14.39
|   |   |   |   |--- truncated branch of depth 14
|--- _edu_Doctorate >  0.50
|   |--- _region_South <= 0.50
|   |   |--- log_wage_usd_yr <= 12.32
|   |   |   |--- _job_exp_Y <= 0.50
|   |   |   |   |--- truncated branch of depth 18
|   |   |   |--- _job_exp_Y >  0.50
|   |   |   |   |--- truncated branch of depth 16
|   |   |--- log_wage_usd_yr >  12.32
|   |   |   |--- log_employees <= 7.40
|   |   |   |   |--- truncated branch of depth 4
|   |   |   |--- log_employees >  7.40
|   |   |   |   |--- truncated branch of depth 10
|   |--- _region_South >  0.50
|   |   |--- employees_clip0_cap <= 9181.50
|   |   |   |--- log_employees <= 8.33
|   |   |   |   |--- truncated branch of depth 14
|   |   |   |--- log_employees >  8.33
|   |   |   |   |--- class: 1.0
|   |   |--- employees_clip0_cap >  9181.50
|   |   |   |--- log_employees <= 11.01
|   |   |   |   |--- truncated branch of depth 5
|   |   |   |--- log_employees >  11.01
|   |   |   |   |--- class: 1.0

No description has been provided for this image
=== RandomForest Visualization @ Recall≥0.90 (threshold=0.3008) ===
No description has been provided for this image
=== RandomForest — Rules for Tree #0 (max_depth=3) @ Recall≥0.90 ===
|--- _edu_Doctorate <= 0.50
|   |--- wage_usd_yr_cap <= 250637.73
|   |   |--- _region_West <= 0.50
|   |   |   |--- _job_exp_Y <= 0.50
|   |   |   |   |--- truncated branch of depth 39
|   |   |   |--- _job_exp_Y >  0.50
|   |   |   |   |--- truncated branch of depth 33
|   |   |--- _region_West >  0.50
|   |   |   |--- _uow_Hour <= 0.50
|   |   |   |   |--- truncated branch of depth 34
|   |   |   |--- _uow_Hour >  0.50
|   |   |   |   |--- truncated branch of depth 11
|   |--- wage_usd_yr_cap >  250637.73
|   |   |--- _job_exp_Y <= 0.50
|   |   |   |--- wage_usd_yr_cap <= 629553.50
|   |   |   |   |--- truncated branch of depth 23
|   |   |   |--- wage_usd_yr_cap >  629553.50
|   |   |   |   |--- truncated branch of depth 22
|   |   |--- _job_exp_Y >  0.50
|   |   |   |--- log_wage_usd_yr <= 14.39
|   |   |   |   |--- truncated branch of depth 17
|   |   |   |--- log_wage_usd_yr >  14.39
|   |   |   |   |--- truncated branch of depth 14
|--- _edu_Doctorate >  0.50
|   |--- _region_South <= 0.50
|   |   |--- log_wage_usd_yr <= 12.32
|   |   |   |--- _job_exp_Y <= 0.50
|   |   |   |   |--- truncated branch of depth 18
|   |   |   |--- _job_exp_Y >  0.50
|   |   |   |   |--- truncated branch of depth 16
|   |   |--- log_wage_usd_yr >  12.32
|   |   |   |--- log_employees <= 7.40
|   |   |   |   |--- truncated branch of depth 4
|   |   |   |--- log_employees >  7.40
|   |   |   |   |--- truncated branch of depth 10
|   |--- _region_South >  0.50
|   |   |--- employees_clip0_cap <= 9181.50
|   |   |   |--- log_employees <= 8.33
|   |   |   |   |--- truncated branch of depth 14
|   |   |   |--- log_employees >  8.33
|   |   |   |   |--- class: 1.0
|   |   |--- employees_clip0_cap >  9181.50
|   |   |   |--- log_employees <= 11.01
|   |   |   |   |--- truncated branch of depth 5
|   |   |   |--- log_employees >  11.01
|   |   |   |   |--- class: 1.0

No description has been provided for this image
In [49]:
# A) Train/Hold-out, calibration, policy threshold, hold-out evaluation
from sklearn.model_selection import train_test_split
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import roc_curve, precision_recall_curve

assert "RNG" in globals() and "DF" in globals(), "Need RNG + DF in memory"
assert "BEST_PIPE" in globals() and "BEST_MODEL_NAME" in globals(), "Run selection first"

# 1) Stratified hold-out split (single final test set)
idx = np.arange(len(DF))
idx_tr, idx_ho = train_test_split(idx, test_size=0.20, stratify=DF["y"], random_state=RNG)
DF_TR, DF_HO = DF.iloc[idx_tr].copy(), DF.iloc[idx_ho].copy()
y_tr,  y_ho  = DF_TR["y"].astype(int).to_numpy(), DF_HO["y"].astype(int).to_numpy()
print(f"[Split] Train={len(DF_TR):,}  Hold-out={len(DF_HO):,}")

# 2) Fit calibrated model on TRAIN (sigmoid/Platt; 5-fold inside train only)
cal = CalibratedClassifierCV(BEST_PIPE, method="sigmoid", cv=5)
cal.fit(DF_TR, y_tr)

# 3) Production threshold policy (choose one)
POLICY = "precision_floor"    # options: "precision_floor", "youden", "recall_floor"
PRECISION_FLOOR = 0.80        # only used if POLICY == "precision_floor"
RECALL_FLOOR    = 0.90        # only used if POLICY == "recall_floor"

# OOF-like selection inside TRAIN: use CV predict_proba on TRAIN via CalibratedClassifierCV
proba_tr = cal.predict_proba(DF_TR)[:, 1]

def pick_threshold(y_true, y_prob, *, method="youden", target=None):
    y_true = np.asarray(y_true).astype(int)
    y_prob = np.asarray(y_prob).astype(float)
    if method == "youden":
        fpr, tpr, thr = roc_curve(y_true, y_prob)
        return float(thr[int(np.argmax(tpr - fpr))])
    elif method == "precision_at":
        prec, rec, thr = precision_recall_curve(y_true, y_prob)
        prec_v, thr_v = prec[1:], thr
        ok = np.where(prec_v >= float(target))[0]
        if ok.size:
            # best F1 among feasible thresholds (robust)
            cands = thr_v[ok]
            from sklearn.metrics import f1_score
            f1s = [f1_score(y_true, (y_prob >= t0).astype(int), zero_division=0) for t0 in cands]
            return float(cands[int(np.argmax(f1s))])
        # fallback: Youden if unattainable
        fpr, tpr, thr = roc_curve(y_true, y_prob)
        return float(thr[int(np.argmax(tpr - fpr))])
    elif method == "recall_at":
        prec, rec, thr = precision_recall_curve(y_true, y_prob)
        rec_v, thr_v = rec[1:], thr
        ok = np.where(rec_v >= float(target))[0]
        if ok.size:
            # highest threshold that still satisfies recall floor
            return float(thr_v[ok].max())
        # fallback: minimal threshold
        return float(thr_v.min() if thr_v.size else 0.0)
    else:
        raise ValueError("Unsupported method")

if POLICY == "precision_floor":
    THR_PROD = pick_threshold(y_tr, proba_tr, method="precision_at", target=PRECISION_FLOOR)
elif POLICY == "recall_floor":
    THR_PROD = pick_threshold(y_tr, proba_tr, method="recall_at", target=RECALL_FLOOR)
else:
    THR_PROD = pick_threshold(y_tr, proba_tr, method="youden")

print(f"[Policy] {POLICY} → production threshold = {THR_PROD:.3f}")

# 4) Hold-out evaluation calibrated
proba_ho = cal.predict_proba(DF_HO)[:, 1]
perf_ho, _ = model_performance_classification_sklearn(y_ho, proba_ho, threshold=THR_PROD)
print("\n=== HOLD-OUT METRICS (calibrated) ===")
display(pd.Series(perf_ho).round(4))

print("\nConfusion @ production threshold:")
confusion_matrix_sklearn(y_ho, proba_ho, threshold=THR_PROD, labels=("Denied","Certified"))
[Split] Train=20,384  Hold-out=5,096
[Policy] precision_floor → production threshold = 0.641

=== HOLD-OUT METRICS (calibrated) ===
roc_auc      0.7630
pr_auc       0.8600
ks           0.3953
brier        0.1807
log_loss     0.5389
accuracy     0.7121
precision    0.7952
recall       0.7665
f1           0.7806
threshold    0.6406
dtype: float64
Confusion @ production threshold:
No description has been provided for this image
Out[49]:
array([[1020,  672],
       [ 795, 2609]])
In [50]:
from sklearn.tree import plot_tree, export_text

# Use the fitted RandomForest pipeline (rf_pipe) and model (rf_model)
rf_model = rf_pipe.named_steps["model"]
rf_feats = rf_pipe.named_steps["prep"].get_feature_names_out()

# Visualize one tree (e.g., tree index 0, depth=3)
tree_idx = 0
tree0 = rf_model.estimators_[tree_idx]

plt.figure(figsize=(18, 10))
plot_tree(
    tree0,
    feature_names=rf_feats,
    class_names=["Denied", "Certified"],
    filled=True, rounded=True, proportion=True, fontsize=7,
    max_depth=3
)
plt.title(f"RandomForest — Tree #{tree_idx} (max_depth=3) @ threshold=0.641 (precision_floor)")
plt.tight_layout()
plt.show()

importances = pd.Series(rf_model.feature_importances_, index=rf_feats)
top_importances = importances.sort_values(ascending=False).head(20)

plt.figure(figsize=(8, 6))
plt.barh(top_importances.index, top_importances.values)
plt.gca().invert_yaxis()
plt.title("RandomForest — Top 20 Feature Importances")
plt.xlabel("Importance")
plt.tight_layout()
plt.show()

importances = pd.Series(rf_model.feature_importances_, index=rf_feats)
top_importances = importances.sort_values(ascending=False).head(20)

# Display tree rules (text format, depth=3)
print(f"=== RandomForest — Rules for Tree #{tree_idx} (max_depth=3) @ threshold=0.641 (precision_floor) ===")
print(export_text(tree0, feature_names=list(rf_feats), max_depth=3))
No description has been provided for this image
No description has been provided for this image
=== RandomForest — Rules for Tree #0 (max_depth=3) @ threshold=0.641 (precision_floor) ===
|--- _edu_Doctorate <= 0.50
|   |--- wage_usd_yr_cap <= 250637.73
|   |   |--- _region_West <= 0.50
|   |   |   |--- _job_exp_Y <= 0.50
|   |   |   |   |--- truncated branch of depth 39
|   |   |   |--- _job_exp_Y >  0.50
|   |   |   |   |--- truncated branch of depth 33
|   |   |--- _region_West >  0.50
|   |   |   |--- _uow_Hour <= 0.50
|   |   |   |   |--- truncated branch of depth 34
|   |   |   |--- _uow_Hour >  0.50
|   |   |   |   |--- truncated branch of depth 11
|   |--- wage_usd_yr_cap >  250637.73
|   |   |--- _job_exp_Y <= 0.50
|   |   |   |--- wage_usd_yr_cap <= 629553.50
|   |   |   |   |--- truncated branch of depth 23
|   |   |   |--- wage_usd_yr_cap >  629553.50
|   |   |   |   |--- truncated branch of depth 22
|   |   |--- _job_exp_Y >  0.50
|   |   |   |--- log_wage_usd_yr <= 14.39
|   |   |   |   |--- truncated branch of depth 17
|   |   |   |--- log_wage_usd_yr >  14.39
|   |   |   |   |--- truncated branch of depth 14
|--- _edu_Doctorate >  0.50
|   |--- _region_South <= 0.50
|   |   |--- log_wage_usd_yr <= 12.32
|   |   |   |--- _job_exp_Y <= 0.50
|   |   |   |   |--- truncated branch of depth 18
|   |   |   |--- _job_exp_Y >  0.50
|   |   |   |   |--- truncated branch of depth 16
|   |   |--- log_wage_usd_yr >  12.32
|   |   |   |--- log_employees <= 7.40
|   |   |   |   |--- truncated branch of depth 4
|   |   |   |--- log_employees >  7.40
|   |   |   |   |--- truncated branch of depth 10
|   |--- _region_South >  0.50
|   |   |--- employees_clip0_cap <= 9181.50
|   |   |   |--- log_employees <= 8.33
|   |   |   |   |--- truncated branch of depth 14
|   |   |   |--- log_employees >  8.33
|   |   |   |   |--- class: 1.0
|   |   |--- employees_clip0_cap >  9181.50
|   |   |   |--- log_employees <= 11.01
|   |   |   |   |--- truncated branch of depth 5
|   |   |   |--- log_employees >  11.01
|   |   |   |   |--- class: 1.0

In [51]:
# B) Curves: ROC, PR, reliability (calibration) on hold-out
import numpy as np, matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score

# ROC
fpr, tpr, _ = roc_curve(y_ho, proba_ho)
rocAUC = auc(fpr, tpr)
plt.figure(figsize=(5.2, 4))
plt.plot(fpr, tpr, label=f"ROC-AUC = {rocAUC:.3f}")
plt.plot([0,1],[0,1],"--",lw=1,alpha=.5)
plt.xlabel("FPR"); plt.ylabel("TPR"); plt.title("Hold-out ROC"); plt.legend(); plt.grid(alpha=.25); plt.tight_layout(); plt.show()

# PR
prec, rec, _ = precision_recall_curve(y_ho, proba_ho)
prAUC = average_precision_score(y_ho, proba_ho)
plt.figure(figsize=(5.2, 4))
plt.plot(rec, prec, label=f"PR-AUC (AP) = {prAUC:.3f}")
plt.xlabel("Recall"); plt.ylabel("Precision"); plt.title("Hold-out PR curve")
plt.legend(); plt.grid(alpha=.25); plt.tight_layout(); plt.show()

# Reliability / calibration curve
prob_true, prob_pred = calibration_curve(y_ho, proba_ho, n_bins=10, strategy="quantile")
plt.figure(figsize=(5.2, 4))
plt.plot(prob_pred, prob_true, marker="o", label="Calibrated")
plt.plot([0,1],[0,1],"--",lw=1,alpha=.5)
plt.xlabel("Predicted probability"); plt.ylabel("Observed frequency")
plt.title("Hold-out reliability curve"); plt.legend(); plt.grid(alpha=.25); plt.tight_layout(); plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [52]:
# C) Slice performance, artifact saving, model card
import json, joblib, pathlib

def slice_report(df, proba, y_true, col, thr):
    rows = []
    for val, idx in df.groupby(col).indices.items():
        yy = y_true[idx]; pp = proba[idx]
        perf, _ = model_performance_classification_sklearn(yy, pp, threshold=thr)
        perf.update(dict(slice=str(col), value=str(val), n=int(len(idx))))
        rows.append(perf)
    return (pd.DataFrame(rows)
            .sort_values("n", ascending=False)
            .loc[:, ["slice","value","n","roc_auc","pr_auc","precision","recall","f1","brier","log_loss"]]
            .round(4))

print("=== Slice metrics (hold-out) ===")
for c in ["_continent","_region","_edu"]:
    if c in DF_HO.columns:
        print(f"\n-- {c} --")
        display(slice_report(DF_HO.reset_index(drop=True), proba_ho, y_ho, c, THR_PROD))

# Save artifacts toggle 
SAVE = True
if SAVE:
    pathlib.Path("models").mkdir(exist_ok=True)
    pathlib.Path("reports").mkdir(exist_ok=True)
    joblib.dump(cal, "models/easyvisa_logit_calibrated.pkl")
    json.dump({"threshold": float(THR_PROD), "rng": int(RNG), "policy": POLICY},
              open("models/threshold.json","w"))
    json.dump({k: float(v) if isinstance(v, (int,float,np.floating)) else v for k,v in perf_ho.items()},
              open("reports/metrics_holdout.json","w"))
    print("[Saved] models/easyvisa_logit_calibrated.pkl, models/threshold.json, reports/metrics_holdout.json")

# Minimal model card 
MODEL_CARD = {
    "Model": BEST_MODEL_NAME + " + sigmoid calibration",
    "Purpose": "Predict likelihood of visa certification.",
    "Data": {"rows": int(len(DF)), "features_numeric": ["log_wage_usd_yr","log_employees","log_company_age"],
             "categoricals": ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"]},
    "Selection_metric": "ROC-AUC (primary), PR-AUC (secondary) via 5-fold OOF on train",
    "Threshold_policy": {"policy": POLICY, "precision_floor": PRECISION_FLOOR, "recall_floor": RECALL_FLOOR,
                         "threshold_production": float(THR_PROD)},
    "Holdout_metrics": {k: float(v) if isinstance(v,(int,float,np.floating)) else v for k,v in perf_ho.items()},
    "Calibration": "Platt (sigmoid), 5-fold on TRAIN; evaluated on HOLD-OUT",
    "Limitations": ["Historical bias possible across regions/education; monitor slice metrics monthly.",
                    "Probability drift expected if wage/employee distributions shift; monitor PSI."],
    "Retrain_triggers": ["PSI > 0.2 on any key feature", "Hold-out ROC-AUC drop > 0.03 vs baseline",
                         "Policy threshold no longer meets precision/recall floor on recent data"]
}
print("\n=== MODEL CARD (summary) ===")
display(pd.Series(MODEL_CARD))
=== Slice metrics (hold-out) ===

-- _continent --
slice value n roc_auc pr_auc precision recall f1 brier log_loss
1 _continent Asia 3325 0.7622 0.8539 0.7837 0.7576 0.7704 0.1839 0.5460
2 _continent Europe 739 0.7628 0.9145 0.8727 0.8861 0.8793 0.1344 0.4280
3 _continent North America 703 0.7102 0.7888 0.7321 0.6721 0.7009 0.2118 0.6149
5 _continent South America 182 0.7055 0.8005 0.7684 0.6348 0.6952 0.2119 0.6183
0 _continent Africa 103 0.8327 0.9395 0.8767 0.8312 0.8533 0.1426 0.4431
4 _continent Oceania 44 0.6873 0.8269 0.7667 0.7419 0.7541 0.1829 0.5448
-- _region --
slice value n roc_auc pr_auc precision recall f1 brier log_loss
2 _region Northeast 1413 0.7902 0.8500 0.8002 0.7744 0.7871 0.1757 0.5318
3 _region South 1360 0.7441 0.8611 0.7916 0.7711 0.7812 0.1861 0.5494
4 _region West 1338 0.7353 0.8183 0.7578 0.6945 0.7248 0.1980 0.5802
1 _region Midwest 906 0.7596 0.9088 0.8381 0.8538 0.8459 0.1532 0.4697
0 _region Island 79 0.7129 0.8310 0.7222 0.5306 0.6118 0.1999 0.5809
-- _edu --
slice value n roc_auc pr_auc precision recall f1 brier log_loss
0 _edu Bachelor's 2044 0.6876 0.7490 0.7292 0.6743 0.7007 0.2098 0.6099
3 _edu Master's 1955 0.7412 0.9051 0.8195 0.9154 0.8648 0.1471 0.4578
2 _edu High School 702 0.5462 0.3993 0.5000 0.0315 0.0593 0.2399 0.6821
1 _edu Doctorate 395 0.7252 0.9473 0.9003 0.9772 0.9372 0.0916 0.3183
[Saved] models/easyvisa_logit_calibrated.pkl, models/threshold.json, reports/metrics_holdout.json

=== MODEL CARD (summary) ===
Model                       Logit (logs+OHE+cw) + sigmoid calibration
Purpose                     Predict likelihood of visa certification.
Data                {'rows': 25480, 'features_numeric': ['log_wage...
Selection_metric    ROC-AUC (primary), PR-AUC (secondary) via 5-fo...
Threshold_policy    {'policy': 'precision_floor', 'precision_floor...
Holdout_metrics     {'roc_auc': 0.7630193792312201, 'pr_auc': 0.86...
Calibration         Platt (sigmoid), 5-fold on TRAIN; evaluated on...
Limitations         [Historical bias possible across regions/educa...
Retrain_triggers    [PSI > 0.2 on any key feature, Hold-out ROC-AU...
dtype: object

Defining scorer to be used for cross-validation and hyperparameter tuning¶

Scoring models with ROC-AUC as the primary metric and PR-AUC (Average Precision) as the secondary. Both are threshold-free and work directly on predicted probabilities, which fits the pipeline picking the operating threshold later by policy—precision ≥ 0.80—after calibration

ROC-AUC measures ranking quality across all thresholds and stays stable under ~2:1 class skew it is the cleanest way to compare models during CV

PR-AUC focuses on the Certified classes precision/recall trade-off when positive class utility matters PR-AUC tells how efficiently models can harvest true positives without flooding false positives

Accuracy can look fine while missing the point under imbalance F1 depends on a specific threshold pushing threshold decisions to a separate policy step Youden or precision floor not into hyper-tuning

Refit on ROC-AUC, but also track PR-AUC, Brier, and LogLoss so we can see ranking + probability quality calibration is applied after picking hyper-params to avoid leakage, and the operating threshold is chosen on the calibrated TRAIN probabilities by policy precision ≥ 0.80 then evaluated on the HOLD-OUT

We are now done with pre-processing and evaluation criterion, so let's start building the model.

Model building with Original data¶

In [53]:
# Model Building — ORIGINAL data 
# DT, RF, Bagging, AdaBoost, HistGradientBoosting
# Calibrate on TRAIN → pick threshold (precision ≥0.80 on TRAIN) → Evaluate on HOLD-OUT

import numpy as np, pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import precision_recall_curve, roc_curve, f1_score

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, HistGradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier

# --- Guards
assert "RNG" in globals() and "DF_TR" in globals() and "DF_HO" in globals(), "Run split first."
assert "model_performance_classification_sklearn" in globals() and "confusion_matrix_sklearn" in globals()

y_tr = DF_TR["y"].astype(int).to_numpy()
y_ho = DF_HO["y"].astype(int).to_numpy()

# --- Precision ≥ floor on TRAIN (fallback Youden)
def _pick_thr_precision_floor(y_true, y_prob, p_floor=0.80):
    y_true = np.asarray(y_true).astype(int); y_prob = np.asarray(y_prob).astype(float)
    prec, rec, thr = precision_recall_curve(y_true, y_prob)
    ok = np.where(prec[1:] >= p_floor)[0]
    if ok.size:
        cands = thr[ok]
        f1s = [f1_score(y_true, (y_prob >= t).astype(int), zero_division=0) for t in cands]
        return float(cands[int(np.argmax(f1s))])
    fpr, tpr, thr = roc_curve(y_true, y_prob)
    return float(thr[int(np.argmax(tpr - fpr))])

# --- Tree-style prep: passthrough numeric + OHE categorical (same as RF you used)
CAT = [c for c in ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"] if c in DF_TR.columns]
NUM = [c for c in ["wage_usd_yr_cap","log_wage_usd_yr","employees_clip0_cap","log_employees","company_age","log_company_age"] if c in DF_TR.columns]
pp_trees = ColumnTransformer(
    [("num", "passthrough", NUM),
     ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), CAT)],
    remainder="drop", verbose_feature_names_out=False
)

# --- 5 baseline models (class_weight on ORIGINAL to respect skew)
DT  = Pipeline([("prep", pp_trees),
                ("model", DecisionTreeClassifier(max_depth=6, class_weight="balanced", random_state=RNG))])

RF  = Pipeline([("prep", pp_trees),
                ("model", RandomForestClassifier(n_estimators=600, random_state=RNG,
                                                 n_jobs=-1, class_weight="balanced_subsample"))])

BAG = Pipeline([("prep", pp_trees),
                ("model", BaggingClassifier(
                    estimator=DecisionTreeClassifier(max_depth=6, class_weight="balanced", random_state=RNG),
                    n_estimators=200, n_jobs=-1, random_state=RNG
                ))])

ADA = Pipeline([("prep", pp_trees),
                ("model", AdaBoostClassifier(
                    DecisionTreeClassifier(max_depth=1, class_weight="balanced", random_state=RNG),
                    n_estimators=100, learning_rate=0.5, random_state=RNG
                ))])

HGB = Pipeline([("prep", pp_trees),
                ("model", HistGradientBoostingClassifier(
                    max_depth=None, learning_rate=0.06, max_iter=600,
                    l2_regularization=0.1, class_weight="balanced", random_state=RNG
                ))])

MODEL_ZOO = {
    "DecisionTree": DT,
    "RandomForest": RF,
    "Bagging":      BAG,
    "AdaBoost":     ADA,
    "HistGB":       HGB,
}

# --- Train→Calibrate→Threshold (TRAIN) → Evaluate (HOLD-OUT)
rows, probs, thrs = [], {}, {}
for name, pipe in MODEL_ZOO.items():
    cal = CalibratedClassifierCV(pipe, method="sigmoid", cv=5).fit(DF_TR, y_tr)
    thr = _pick_thr_precision_floor(y_tr, cal.predict_proba(DF_TR)[:,1], p_floor=0.80)
    proba_ho = cal.predict_proba(DF_HO)[:,1]
    perf, _ = model_performance_classification_sklearn(y_ho, proba_ho, threshold=thr)
    perf["model"] = name
    rows.append(perf); probs[name] = proba_ho; thrs[name] = thr

ORIG_5MODELS = (pd.DataFrame(rows)
    .loc[:, ["model","roc_auc","pr_auc","ks","brier","log_loss","accuracy","precision","recall","f1","threshold"]]
    .sort_values(["roc_auc","pr_auc"], ascending=False)
    .reset_index(drop=True).round(4))

print("=== ORIGINAL data — 5 classification models (calibrated, precision≥0.80 policy) ===")
display(ORIG_5MODELS)

# --- Confusion matrices for the top 2 by ROC-AUC
top2 = ORIG_5MODELS.head(2)["model"].tolist()
for m in top2:
    print(f"\nConfusion — {m} @ thr={thrs[m]:.3f}")
    confusion_matrix_sklearn(y_ho, probs[m], threshold=thrs[m], labels=("Denied","Certified"))
=== ORIGINAL data — 5 classification models (calibrated, precision≥0.80 policy) ===
model roc_auc pr_auc ks brier log_loss accuracy precision recall f1 threshold
0 HistGB 0.7790 0.8749 0.4119 0.1739 0.5169 0.7408 0.7904 0.8328 0.8110 0.5690
1 Bagging 0.7705 0.8678 0.4041 0.1780 0.5306 0.7325 0.7910 0.8149 0.8028 0.5068
2 DecisionTree 0.7661 0.8560 0.3974 0.1790 0.5335 0.7400 0.7725 0.8657 0.8165 0.4881
3 RandomForest 0.7622 0.8657 0.3917 0.1810 0.5388 0.7241 0.7707 0.8355 0.8018 0.5505
4 AdaBoost 0.6782 0.7654 0.3015 0.1993 0.5856 0.7060 0.7169 0.9254 0.8079 0.6132
Confusion — HistGB @ thr=0.569
No description has been provided for this image
Confusion — Bagging @ thr=0.507
No description has been provided for this image
  • Selection rule: By ROC-AUC (primary), HistGB wins — 0.7790, ahead of Bagging 0.7705, Logit 0.7630, RF 0.7622

  • At the cutoff: HistGB has the best overall F1/recall balance; DecisionTree pushes recall highest but trades off precision — threshold effects, not ranking

  • Precision floor check: Thresholds were set on TRAIN (≥0.80). On HOLD-OUT: HistGB ≈ 0.790 precision, Bagging ≈ 0.791 — noted, no changes made

  • Calibration: Reliability looks solid; probabilities are usable

  • Bottom line: Keep HistGB as the baseline reference for the original-data setting; carry this same setup into over/under-sampling and tuning

In [54]:
# Visualize model: tree structure, rules, and feature importances for ORIGINAL data models

# DecisionTree visualization (first tree in Bagging/RandomForest)
for name, pipe in MODEL_ZOO.items():
    # Fit the pipeline on training data before accessing feature names
    pipe.fit(DF_TR, y_tr)
    model = pipe.named_steps["model"]
    prep = pipe.named_steps["prep"]
    feats = prep.get_feature_names_out()
    
    print(f"\n=== {name} ===")
    # For ensemble models, visualize the first tree if possible
    tree = None
    if hasattr(model, "estimators_"):
        tree = model.estimators_[0]
    elif isinstance(model, DecisionTreeClassifier):
        tree = model

    # Only plot tree if it's a DecisionTreeClassifier
    if tree is not None and isinstance(tree, DecisionTreeClassifier):
        plt.figure(figsize=(18, 10))
        plot_tree(
            tree,
            feature_names=feats,
            class_names=["Denied", "Certified"],
            filled=True, rounded=True, proportion=True, fontsize=7,
            max_depth=3
        )
        plt.title(f"{name} — Tree #0 (max_depth=3)")
        plt.tight_layout()
        plt.show()

        # Print tree rules (text format, depth=3)
        print(f"=== {name} — Rules for Tree #0 (max_depth=3) ===")
        print(export_text(tree, feature_names=list(feats), max_depth=3))
    else:
        print(f"[SKIP] Tree visualization not supported for {type(model).__name__}")

    # Feature importances (top 20)
    if hasattr(model, "feature_importances_"):
        importances = pd.Series(model.feature_importances_, index=feats)
        top_imp = importances.sort_values(ascending=False).head(20)
        plt.figure(figsize=(8, 6))
        plt.barh(top_imp.index, top_imp.values)
        plt.gca().invert_yaxis()
        plt.title(f"{name} — Top 20 Feature Importances")
        plt.xlabel("Importance")
        plt.tight_layout()
        plt.show()
=== DecisionTree ===
No description has been provided for this image
=== DecisionTree — Rules for Tree #0 (max_depth=3) ===
|--- _edu_High School <= 0.50
|   |--- _job_exp_N <= 0.50
|   |   |--- _edu_Bachelor's <= 0.50
|   |   |   |--- _uow_Year <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _uow_Year >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _edu_Bachelor's >  0.50
|   |   |   |--- _uow_Hour <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _uow_Hour >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |--- _job_exp_N >  0.50
|   |   |--- _continent_Europe <= 0.50
|   |   |   |--- _uow_Year <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _uow_Year >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _continent_Europe >  0.50
|   |   |   |--- _uow_Hour <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _uow_Hour >  0.50
|   |   |   |   |--- truncated branch of depth 3
|--- _edu_High School >  0.50
|   |--- _continent_North America <= 0.50
|   |   |--- _continent_South America <= 0.50
|   |   |   |--- _region_Midwest <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _region_Midwest >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _continent_South America >  0.50
|   |   |   |--- company_age <= 6.50
|   |   |   |   |--- truncated branch of depth 2
|   |   |   |--- company_age >  6.50
|   |   |   |   |--- truncated branch of depth 3
|   |--- _continent_North America >  0.50
|   |   |--- _job_exp_N <= 0.50
|   |   |   |--- wage_usd_yr_cap <= 23696.56
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- wage_usd_yr_cap >  23696.56
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _job_exp_N >  0.50
|   |   |   |--- log_company_age <= 3.81
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- log_company_age >  3.81
|   |   |   |   |--- truncated branch of depth 3

No description has been provided for this image
=== RandomForest ===
No description has been provided for this image
=== RandomForest — Rules for Tree #0 (max_depth=3) ===
|--- _edu_Doctorate <= 0.50
|   |--- wage_usd_yr_cap <= 284351.00
|   |   |--- _region_West <= 0.50
|   |   |   |--- _job_exp_Y <= 0.50
|   |   |   |   |--- truncated branch of depth 33
|   |   |   |--- _job_exp_Y >  0.50
|   |   |   |   |--- truncated branch of depth 38
|   |   |--- _region_West >  0.50
|   |   |   |--- _edu_High School <= 0.50
|   |   |   |   |--- truncated branch of depth 27
|   |   |   |--- _edu_High School >  0.50
|   |   |   |   |--- truncated branch of depth 18
|   |--- wage_usd_yr_cap >  284351.00
|   |   |--- log_wage_usd_yr <= 14.37
|   |   |   |--- _job_exp_N <= 0.50
|   |   |   |   |--- truncated branch of depth 16
|   |   |   |--- _job_exp_N >  0.50
|   |   |   |   |--- truncated branch of depth 29
|   |   |--- log_wage_usd_yr >  14.37
|   |   |   |--- employees_clip0_cap <= 3387.50
|   |   |   |   |--- truncated branch of depth 14
|   |   |   |--- employees_clip0_cap >  3387.50
|   |   |   |   |--- truncated branch of depth 9
|--- _edu_Doctorate >  0.50
|   |--- _uow_Hour <= 0.50
|   |   |--- _job_train_Y <= 0.50
|   |   |   |--- _region_South <= 0.50
|   |   |   |   |--- truncated branch of depth 16
|   |   |   |--- _region_South >  0.50
|   |   |   |   |--- truncated branch of depth 12
|   |   |--- _job_train_Y >  0.50
|   |   |   |--- wage_usd_yr_cap <= 83249.25
|   |   |   |   |--- truncated branch of depth 11
|   |   |   |--- wage_usd_yr_cap >  83249.25
|   |   |   |   |--- truncated branch of depth 17
|   |--- _uow_Hour >  0.50
|   |   |--- _region_Midwest <= 0.50
|   |   |   |--- log_company_age <= 2.35
|   |   |   |   |--- class: 1.0
|   |   |   |--- log_company_age >  2.35
|   |   |   |   |--- truncated branch of depth 7
|   |   |--- _region_Midwest >  0.50
|   |   |   |--- _continent_Asia <= 0.50
|   |   |   |   |--- class: 1.0
|   |   |   |--- _continent_Asia >  0.50
|   |   |   |   |--- class: 1.0

No description has been provided for this image
=== Bagging ===
No description has been provided for this image
=== Bagging — Rules for Tree #0 (max_depth=3) ===
|--- _edu_High School <= 0.50
|   |--- _job_exp_N <= 0.50
|   |   |--- _edu_Bachelor's <= 0.50
|   |   |   |--- _uow_Year <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _uow_Year >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _edu_Bachelor's >  0.50
|   |   |   |--- _uow_Hour <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _uow_Hour >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |--- _job_exp_N >  0.50
|   |   |--- _continent_Europe <= 0.50
|   |   |   |--- _uow_Year <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _uow_Year >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _continent_Europe >  0.50
|   |   |   |--- log_wage_usd_yr <= 12.15
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- log_wage_usd_yr >  12.15
|   |   |   |   |--- truncated branch of depth 3
|--- _edu_High School >  0.50
|   |--- _continent_North America <= 0.50
|   |   |--- _region_Midwest <= 0.50
|   |   |   |--- _continent_South America <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _continent_South America >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _region_Midwest >  0.50
|   |   |   |--- company_age <= 8.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- company_age >  8.50
|   |   |   |   |--- truncated branch of depth 3
|   |--- _continent_North America >  0.50
|   |   |--- _job_exp_Y <= 0.50
|   |   |   |--- log_company_age <= 4.03
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- log_company_age >  4.03
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _job_exp_Y >  0.50
|   |   |   |--- wage_usd_yr_cap <= 27212.02
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- wage_usd_yr_cap >  27212.02
|   |   |   |   |--- truncated branch of depth 3


=== AdaBoost ===
No description has been provided for this image
=== AdaBoost — Rules for Tree #0 (max_depth=3) ===
|--- _edu_High School <= 0.50
|   |--- class: 1
|--- _edu_High School >  0.50
|   |--- class: 0

No description has been provided for this image
=== HistGB ===
[SKIP] Tree visualization not supported for HistGradientBoostingClassifier

Model Building with Oversampled data¶

Oversampled data the same way as the original section, but using an oversampling step on TRAIN only (never the hold-out). Using an imblearn pipeline with RandomOverSampler so there’s no leakage, then calibrate, pick the policy threshold (precision ≥ 0.80), and evaluate on HOLD-OUT

In [55]:
# Model Building — OVERSAMPLED data (TRAIN only)
# DT, RF, Bagging, AdaBoost, HistGradientBoosting
# Calibrate on OVERSAMPLED TRAIN → pick threshold (precision ≥0.80 on TRAIN) → Evaluate on HOLD-OUT

from sklearn.preprocessing import OneHotEncoder

from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, HistGradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier

assert "RNG" in globals() and "DF_TR" in globals() and "DF_HO" in globals(), "Run the original split first."
assert "model_performance_classification_sklearn" in globals() and "confusion_matrix_sklearn" in globals()

y_ho = DF_HO["y"].astype(int).to_numpy()

# Helper: precision ≥ floor on TRAIN (fallback Youden)
def _pick_thr_precision_floor(y_true, y_prob, p_floor=0.80):
    y_true = np.asarray(y_true).astype(int); y_prob = np.asarray(y_prob).astype(float)
    prec, rec, thr = precision_recall_curve(y_true, y_prob)
    ok = np.where(prec[1:] >= p_floor)[0]
    if ok.size:
        cands = thr[ok]
        f1s = [f1_score(y_true, (y_prob >= t).astype(int), zero_division=0) for t in cands]
        return float(cands[int(np.argmax(f1s))])
    fpr, tpr, thr = roc_curve(y_true, y_prob)
    return float(thr[int(np.argmax(tpr - fpr))])

# 1) Create OVERSAMPLED TRAIN (balance classes)
def _make_os(df):
    pos = df[df["y"]==1]; neg = df[df["y"]==0]
    minority, majority = (neg, pos) if len(pos)>=len(neg) else (pos, neg)
    mi = minority.sample(n=len(majority), replace=True, random_state=RNG)
    return pd.concat([majority, mi], axis=0).sample(frac=1.0, random_state=RNG).reset_index(drop=True)

DF_TR_OS = globals().get("DF_TR_OS", _make_os(DF_TR))
y_tr_os = DF_TR_OS["y"].astype(int).to_numpy()

print(f"[Oversample] TRAIN original={len(DF_TR):,}  →  oversampled={len(DF_TR_OS):,}")

# 2) Preprocessors (tree-style: passthrough NUM + OHE CAT)
CAT = [c for c in ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"] if c in DF_TR_OS.columns]
NUM = [c for c in ["wage_usd_yr_cap","log_wage_usd_yr","employees_clip0_cap","log_employees","company_age","log_company_age"] if c in DF_TR_OS.columns]
pp_trees = ColumnTransformer(
    [("num", "passthrough", NUM),
     ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), CAT)],
    remainder="drop", verbose_feature_names_out=False
)

# 3) Five models (class_weight=None to isolate oversampling effect)
DT  = Pipeline([("prep", pp_trees),
                ("model", DecisionTreeClassifier(max_depth=6, class_weight=None, random_state=RNG))])

RF  = Pipeline([("prep", pp_trees),
                ("model", RandomForestClassifier(n_estimators=600, random_state=RNG, n_jobs=-1, class_weight=None))])

BAG = Pipeline([("prep", pp_trees),
                ("model", BaggingClassifier(
                    estimator=DecisionTreeClassifier(max_depth=6, class_weight=None, random_state=RNG),
                    n_estimators=200, n_jobs=-1, random_state=RNG
                ))])

ADA = Pipeline([("prep", pp_trees),
                ("model", AdaBoostClassifier(
                    DecisionTreeClassifier(max_depth=1, class_weight=None, random_state=RNG),
                    n_estimators=100, learning_rate=0.5, random_state=RNG
                ))])

HGB = Pipeline([("prep", pp_trees),
                ("model", HistGradientBoostingClassifier(
                    max_depth=None, learning_rate=0.06, max_iter=600,
                    l2_regularization=0.1, class_weight=None, random_state=RNG
                ))])

MODEL_ZOO_OS = {"DecisionTree": DT, "RandomForest": RF, "Bagging": BAG, "AdaBoost": ADA, "HistGB": HGB}

# 4) Train → Calibrate → Threshold on OVERSAMPLED TRAIN → Evaluate on HOLD-OUT
rows, probs, thrs = [], {}, {}
for name, pipe in MODEL_ZOO_OS.items():
    cal = CalibratedClassifierCV(pipe, method="sigmoid", cv=5).fit(DF_TR_OS, y_tr_os)
    thr = _pick_thr_precision_floor(y_tr_os, cal.predict_proba(DF_TR_OS)[:,1], p_floor=0.80)
    proba_ho = cal.predict_proba(DF_HO)[:,1]
    perf, _ = model_performance_classification_sklearn(y_ho, proba_ho, threshold=thr)
    perf["model"] = name
    rows.append(perf); probs[name] = proba_ho; thrs[name] = thr

OS_5MODELS = (pd.DataFrame(rows)
    .loc[:, ["model","roc_auc","pr_auc","ks","brier","log_loss","accuracy","precision","recall","f1","threshold"]]
    .sort_values(["roc_auc","pr_auc"], ascending=False)
    .reset_index(drop=True).round(4))

print("=== OVERSAMPLED TRAIN — 5 classification models (calibrated, precision≥0.80 policy) ===")
display(OS_5MODELS)

# Confusion matrices for the top 2 by ROC-AUC
_top2 = OS_5MODELS.head(2)["model"].tolist()
for m in _top2:
    print(f"\nConfusion — {m} (OS) @ thr={thrs[m]:.3f}")
    confusion_matrix_sklearn(y_ho, probs[m], threshold=thrs[m], labels=("Denied","Certified"))
[Oversample] TRAIN original=20,384  →  oversampled=27,228
=== OVERSAMPLED TRAIN — 5 classification models (calibrated, precision≥0.80 policy) ===
model roc_auc pr_auc ks brier log_loss accuracy precision recall f1 threshold
0 Bagging 0.7694 0.8667 0.4021 0.1969 0.5772 0.6046 0.8810 0.4718 0.6145 0.6521
1 HistGB 0.7681 0.8664 0.4001 0.1890 0.5580 0.7219 0.8050 0.7703 0.7873 0.5000
2 DecisionTree 0.7650 0.8543 0.3938 0.1981 0.5812 0.6050 0.8811 0.4724 0.6150 0.6439
3 AdaBoost 0.7618 0.8595 0.3889 0.1990 0.5870 0.5861 0.8825 0.4389 0.5862 0.6811
4 RandomForest 0.7563 0.8623 0.3778 0.1965 0.6018 0.7111 0.7856 0.7806 0.7831 0.6739
Confusion — Bagging (OS) @ thr=0.652
No description has been provided for this image
Confusion — HistGB (OS) @ thr=0.500
No description has been provided for this image
  • Ranking rule: Bagging 0.7694 ROC-AUC vs HistGB 0.7681 — Bagging still edges it, but neither beats the original runs (HistGB 0.7790, Logit 0.7630, RF 0.7622). PR-AUC is basically flat

  • At the policy cutoff (precision ≥ 0.80 set on TRAIN):

    • Bagging (oversampled): precision 0.8810, recall 0.4718 — recall tanks

    • HistGB (oversampled): precision 0.8050, recall 0.7703 — more balanced, but a hair worse than original RF

Bottom line: Oversampling added variance and no upside here from the original class-weight setup

In [56]:
# Visualize Bagging model: tree structure, rules, and feature importances
# Fit the Bagging pipeline if not already fitted
BAG.fit(DF_TR, y_tr)
bag_model = BAG.named_steps["model"]
bag_prep = BAG.named_steps["prep"]
bag_feats = bag_prep.get_feature_names_out()

# Visualize the first tree in the Bagging ensemble
tree_idx = 0
tree0 = bag_model.estimators_[tree_idx]

plt.figure(figsize=(18, 10))
plot_tree(
    tree0,
    feature_names=bag_feats,
    class_names=["Denied", "Certified"],
    filled=True, rounded=True, proportion=True, fontsize=7,
    max_depth=3
)
plt.title(f"Bagging — Tree #{tree_idx} (max_depth=3)")
plt.tight_layout()
plt.show()

# Print tree rules (text format, depth=3)
print(f"=== Bagging — Rules for Tree #{tree_idx} (max_depth=3) ===")
print(export_text(tree0, feature_names=list(bag_feats), max_depth=3))

# Feature importances (top 20)
if hasattr(tree0, "feature_importances_"):
    importances = pd.Series(tree0.feature_importances_, index=bag_feats)
    top_imp = importances.sort_values(ascending=False).head(20)
    plt.figure(figsize=(8, 6))
    plt.barh(top_imp.index, top_imp.values)
    plt.gca().invert_yaxis()
    plt.title("Bagging — Top 20 Feature Importances (Tree #0)")
    plt.xlabel("Importance")
    plt.tight_layout()
    plt.show()
No description has been provided for this image
=== Bagging — Rules for Tree #0 (max_depth=3) ===
|--- _edu_High School <= 0.50
|   |--- _job_exp_Y <= 0.50
|   |   |--- _uow_Year <= 0.50
|   |   |   |--- _edu_Doctorate <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _edu_Doctorate >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _uow_Year >  0.50
|   |   |   |--- _continent_Europe <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _continent_Europe >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |--- _job_exp_Y >  0.50
|   |   |--- _edu_Bachelor's <= 0.50
|   |   |   |--- _uow_Year <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _uow_Year >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _edu_Bachelor's >  0.50
|   |   |   |--- _uow_Hour <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _uow_Hour >  0.50
|   |   |   |   |--- truncated branch of depth 3
|--- _edu_High School >  0.50
|   |--- _continent_North America <= 0.50
|   |   |--- _region_Midwest <= 0.50
|   |   |   |--- _continent_Asia <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _continent_Asia >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _region_Midwest >  0.50
|   |   |   |--- log_company_age <= 4.87
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- log_company_age >  4.87
|   |   |   |   |--- truncated branch of depth 3
|   |--- _continent_North America >  0.50
|   |   |--- _job_exp_Y <= 0.50
|   |   |   |--- company_age <= 55.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- company_age >  55.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _job_exp_Y >  0.50
|   |   |   |--- log_wage_usd_yr <= 10.21
|   |   |   |   |--- truncated branch of depth 2
|   |   |   |--- log_wage_usd_yr >  10.21
|   |   |   |   |--- truncated branch of depth 3

No description has been provided for this image

Model Building with Undersampled data¶

In [57]:
# Model Building — UNDERSAMPLED data (TRAIN only)
# DT, RF, Bagging, AdaBoost, HistGradientBoosting
# Calibrate on UNDERSAMPLED TRAIN → pick threshold (precision ≥0.80 on TRAIN) → Evaluate on HOLD-OUT

assert "RNG" in globals() and "DF_TR" in globals() and "DF_HO" in globals(), "Run the original split first."
assert "model_performance_classification_sklearn" in globals() and "confusion_matrix_sklearn" in globals()

y_ho = DF_HO["y"].astype(int).to_numpy()

# ---------- Helper: precision ≥ floor on TRAIN (fallback Youden)
def _pick_thr_precision_floor(y_true, y_prob, p_floor=0.80):
    y_true = np.asarray(y_true).astype(int); y_prob = np.asarray(y_prob).astype(float)
    prec, rec, thr = precision_recall_curve(y_true, y_prob)
    ok = np.where(prec[1:] >= p_floor)[0]
    if ok.size:
        cands = thr[ok]
        f1s = [f1_score(y_true, (y_prob >= t).astype(int), zero_division=0) for t in cands]
        return float(cands[int(np.argmax(f1s))])
    fpr, tpr, thr = roc_curve(y_true, y_prob)
    return float(thr[int(np.argmax(tpr - fpr))])

# ---------- 1) Create UNDERSAMPLED TRAIN (balance by downsampling majority)
def _make_us(df):
    pos = df[df["y"]==1]; neg = df[df["y"]==0]
    majority, minority = (pos, neg) if len(pos)>=len(neg) else (neg, pos)
    maj = majority.sample(n=len(minority), replace=False, random_state=RNG)
    return pd.concat([maj, minority], axis=0).sample(frac=1.0, random_state=RNG).reset_index(drop=True)

DF_TR_US = globals().get("DF_TR_US", _make_us(DF_TR))
y_tr_us = DF_TR_US["y"].astype(int).to_numpy()

print(f"[Undersample] TRAIN original={len(DF_TR):,}  →  undersampled={len(DF_TR_US):,}")

# ---------- 2) Preprocessors (tree-style: passthrough NUM + OHE CAT)
CAT = [c for c in ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"] if c in DF_TR_US.columns]
NUM = [c for c in ["wage_usd_yr_cap","log_wage_usd_yr","employees_clip0_cap","log_employees","company_age","log_company_age"] if c in DF_TR_US.columns]
pp_trees = ColumnTransformer(
    [("num", "passthrough", NUM),
     ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), CAT)],
    remainder="drop", verbose_feature_names_out=False
)

# ---------- 3) Five models (class_weight=None to isolate undersampling effect)
DT  = Pipeline([("prep", pp_trees),
                ("model", DecisionTreeClassifier(max_depth=6, class_weight=None, random_state=RNG))])

RF  = Pipeline([("prep", pp_trees),
                ("model", RandomForestClassifier(n_estimators=600, random_state=RNG, n_jobs=-1, class_weight=None))])

BAG = Pipeline([("prep", pp_trees),
                ("model", BaggingClassifier(
                    estimator=DecisionTreeClassifier(max_depth=6, class_weight=None, random_state=RNG),
                    n_estimators=200, n_jobs=-1, random_state=RNG
                ))])

ADA = Pipeline([("prep", pp_trees),
                ("model", AdaBoostClassifier(
                    DecisionTreeClassifier(max_depth=1, class_weight=None, random_state=RNG),
                    n_estimators=100, learning_rate=0.5, random_state=RNG
                ))])

HGB = Pipeline([("prep", pp_trees),
                ("model", HistGradientBoostingClassifier(
                    max_depth=None, learning_rate=0.06, max_iter=600,
                    l2_regularization=0.1, class_weight=None, random_state=RNG
                ))])

MODEL_ZOO_US = {"DecisionTree": DT, "RandomForest": RF, "Bagging": BAG, "AdaBoost": ADA, "HistGB": HGB}

# ---------- 4) Train → Calibrate → Threshold on UNDERSAMPLED TRAIN → Evaluate on HOLD-OUT
rows, probs, thrs = [], {}, {}
for name, pipe in MODEL_ZOO_US.items():
    cal = CalibratedClassifierCV(pipe, method="sigmoid", cv=5).fit(DF_TR_US, y_tr_us)
    thr = _pick_thr_precision_floor(y_tr_us, cal.predict_proba(DF_TR_US)[:,1], p_floor=0.80)
    proba_ho = cal.predict_proba(DF_HO)[:,1]
    perf, _ = model_performance_classification_sklearn(y_ho, proba_ho, threshold=thr)
    perf["model"] = name
    rows.append(perf); probs[name] = proba_ho; thrs[name] = thr

US_5MODELS = (pd.DataFrame(rows)
    .loc[:, ["model","roc_auc","pr_auc","ks","brier","log_loss","accuracy","precision","recall","f1","threshold"]]
    .sort_values(["roc_auc","pr_auc"], ascending=False)
    .reset_index(drop=True).round(4))

print("=== UNDERSAMPLED TRAIN — 5 classification models (calibrated, precision≥0.80 policy) ===")
display(US_5MODELS)

# Confusion matrices for the top 2 by ROC-AUC
_top2 = US_5MODELS.head(2)["model"].tolist()
for m in _top2:
    print(f"\nConfusion — {m} (US) @ thr={thrs[m]:.3f}")
    confusion_matrix_sklearn(y_ho, probs[m], threshold=thrs[m], labels=("Denied","Certified"))
[Undersample] TRAIN original=20,384  →  undersampled=13,540
=== UNDERSAMPLED TRAIN — 5 classification models (calibrated, precision≥0.80 policy) ===
model roc_auc pr_auc ks brier log_loss accuracy precision recall f1 threshold
0 HistGB 0.7765 0.8743 0.4162 0.1940 0.5658 0.6391 0.8631 0.5464 0.6692 0.5850
1 Bagging 0.7705 0.8695 0.4044 0.1972 0.5779 0.6058 0.8826 0.4727 0.6156 0.6342
2 DecisionTree 0.7650 0.8596 0.3877 0.1981 0.5803 0.6044 0.8805 0.4718 0.6144 0.6437
3 AdaBoost 0.7620 0.8598 0.3908 0.1985 0.5861 0.5842 0.8827 0.4354 0.5831 0.6782
4 RandomForest 0.7614 0.8660 0.3895 0.2005 0.5848 0.6752 0.8319 0.6439 0.7259 0.5313
Confusion — HistGB (US) @ thr=0.585
No description has been provided for this image
Confusion — Bagging (US) @ thr=0.634
No description has been provided for this image
  • Ranking rule: HistGB 0.7763 ROC-AUC, Bagging 0.7701 — basically the same as the original (HistGB 0.7790, Logit 0.7630, RF 0.7622); no lift

  • At the cutoff precision ≥ 0.80 set on TRAIN:

    • Bagging undersampled: precision 0.8824, recall 0.4739 still recall tanks

    • RF undersampled: precision 0.8303, recall 0.6513, F1 0.7300 meaning more balanced than Bagging-US, still not better than original RF

Bottom line: Undersampling drops data and does not help here sticking with the original class-weight + calibrated Logit as the reference, and treat OS/US as negative controls

In [58]:
# Visualize UNDERSAMPLED TRAIN 
for name, pipe in MODEL_ZOO_US.items():
    # Fit the pipeline on undersampled training data before accessing feature names
    pipe.fit(DF_TR_US, y_tr_us)
    model = pipe.named_steps["model"]
    prep = pipe.named_steps["prep"]
    feats = prep.get_feature_names_out()
    
    print(f"\n=== {name} (Undersampled) ===")
    # For ensemble models, visualize the first tree if possible
    tree = None
    if hasattr(model, "estimators_"):
        tree = model.estimators_[0]
    elif isinstance(model, DecisionTreeClassifier):
        tree = model

    # Only plot tree if it's a DecisionTreeClassifier
    if tree is not None and isinstance(tree, DecisionTreeClassifier):
        plt.figure(figsize=(18, 10))
        plot_tree(
            tree,
            feature_names=feats,
            class_names=["Denied", "Certified"],
            filled=True, rounded=True, proportion=True, fontsize=7,
            max_depth=3
        )
        plt.title(f"{name} (Undersampled) — Tree #0 (max_depth=3)")
        plt.tight_layout()
        plt.show()

        # Print tree rules (text format, depth=3)
        print(f"=== {name} (Undersampled) — Rules for Tree #0 (max_depth=3) ===")
        print(export_text(tree, feature_names=list(feats), max_depth=3))
    else:
        print(f"[SKIP] Tree visualization not supported for {type(model).__name__}")

    # Feature importances (top 20)
    if hasattr(model, "feature_importances_"):
        importances = pd.Series(model.feature_importances_, index=feats)
        top_imp = importances.sort_values(ascending=False).head(20)
        plt.figure(figsize=(8, 6))
        plt.barh(top_imp.index, top_imp.values)
        plt.gca().invert_yaxis()
        plt.title(f"{name} (Undersampled) — Top 20 Feature Importances")
        plt.xlabel("Importance")
        plt.tight_layout()
        plt.show()
=== DecisionTree (Undersampled) ===
No description has been provided for this image
=== DecisionTree (Undersampled) — Rules for Tree #0 (max_depth=3) ===
|--- _edu_High School <= 0.50
|   |--- _job_exp_Y <= 0.50
|   |   |--- _uow_Year <= 0.50
|   |   |   |--- _edu_Doctorate <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _edu_Doctorate >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _uow_Year >  0.50
|   |   |   |--- _continent_Europe <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _continent_Europe >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |--- _job_exp_Y >  0.50
|   |   |--- _edu_Bachelor's <= 0.50
|   |   |   |--- _uow_Year <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _uow_Year >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _edu_Bachelor's >  0.50
|   |   |   |--- _uow_Hour <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _uow_Hour >  0.50
|   |   |   |   |--- truncated branch of depth 3
|--- _edu_High School >  0.50
|   |--- _continent_Asia <= 0.50
|   |   |--- _continent_Europe <= 0.50
|   |   |   |--- _job_exp_Y <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _job_exp_Y >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _continent_Europe >  0.50
|   |   |   |--- log_employees <= 7.27
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- log_employees >  7.27
|   |   |   |   |--- truncated branch of depth 3
|   |--- _continent_Asia >  0.50
|   |   |--- _region_Northeast <= 0.50
|   |   |   |--- _region_West <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _region_West >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _region_Northeast >  0.50
|   |   |   |--- wage_usd_yr_cap <= 504597.28
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- wage_usd_yr_cap >  504597.28
|   |   |   |   |--- truncated branch of depth 3

No description has been provided for this image
=== RandomForest (Undersampled) ===
No description has been provided for this image
=== RandomForest (Undersampled) — Rules for Tree #0 (max_depth=3) ===
|--- _edu_Doctorate <= 0.50
|   |--- wage_usd_yr_cap <= 252093.80
|   |   |--- _region_West <= 0.50
|   |   |   |--- _job_exp_Y <= 0.50
|   |   |   |   |--- truncated branch of depth 30
|   |   |   |--- _job_exp_Y >  0.50
|   |   |   |   |--- truncated branch of depth 40
|   |   |--- _region_West >  0.50
|   |   |   |--- _edu_High School <= 0.50
|   |   |   |   |--- truncated branch of depth 28
|   |   |   |--- _edu_High School >  0.50
|   |   |   |   |--- truncated branch of depth 18
|   |--- wage_usd_yr_cap >  252093.80
|   |   |--- _job_exp_Y <= 0.50
|   |   |   |--- _uow_Hour <= 0.50
|   |   |   |   |--- truncated branch of depth 10
|   |   |   |--- _uow_Hour >  0.50
|   |   |   |   |--- truncated branch of depth 22
|   |   |--- _job_exp_Y >  0.50
|   |   |   |--- _uow_Hour <= 0.50
|   |   |   |   |--- truncated branch of depth 12
|   |   |   |--- _uow_Hour >  0.50
|   |   |   |   |--- truncated branch of depth 15
|--- _edu_Doctorate >  0.50
|   |--- wage_usd_yr_cap <= 213888.59
|   |   |--- log_wage_usd_yr <= 10.36
|   |   |   |--- log_company_age <= 4.48
|   |   |   |   |--- truncated branch of depth 10
|   |   |   |--- log_company_age >  4.48
|   |   |   |   |--- class: 1.0
|   |   |--- log_wage_usd_yr >  10.36
|   |   |   |--- _region_Island <= 0.50
|   |   |   |   |--- truncated branch of depth 18
|   |   |   |--- _region_Island >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |--- wage_usd_yr_cap >  213888.59
|   |   |--- _region_Northeast <= 0.50
|   |   |   |--- company_age <= 9.50
|   |   |   |   |--- class: 1.0
|   |   |   |--- company_age >  9.50
|   |   |   |   |--- truncated branch of depth 7
|   |   |--- _region_Northeast >  0.50
|   |   |   |--- log_company_age <= 2.92
|   |   |   |   |--- truncated branch of depth 4
|   |   |   |--- log_company_age >  2.92
|   |   |   |   |--- truncated branch of depth 4

No description has been provided for this image
=== Bagging (Undersampled) ===
No description has been provided for this image
=== Bagging (Undersampled) — Rules for Tree #0 (max_depth=3) ===
|--- _edu_High School <= 0.50
|   |--- _job_exp_Y <= 0.50
|   |   |--- _uow_Hour <= 0.50
|   |   |   |--- _continent_Europe <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _continent_Europe >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _uow_Hour >  0.50
|   |   |   |--- log_employees <= 7.96
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- log_employees >  7.96
|   |   |   |   |--- truncated branch of depth 3
|   |--- _job_exp_Y >  0.50
|   |   |--- _edu_Bachelor's <= 0.50
|   |   |   |--- _uow_Year <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _uow_Year >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _edu_Bachelor's >  0.50
|   |   |   |--- _uow_Year <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _uow_Year >  0.50
|   |   |   |   |--- truncated branch of depth 3
|--- _edu_High School >  0.50
|   |--- _continent_North America <= 0.50
|   |   |--- _region_Northeast <= 0.50
|   |   |   |--- _region_West <= 0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- _region_West >  0.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |--- _region_Northeast >  0.50
|   |   |   |--- log_wage_usd_yr <= 8.67
|   |   |   |   |--- truncated branch of depth 2
|   |   |   |--- log_wage_usd_yr >  8.67
|   |   |   |   |--- truncated branch of depth 3
|   |--- _continent_North America >  0.50
|   |   |--- _job_exp_Y <= 0.50
|   |   |   |--- employees_clip0_cap <= 7673.50
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- employees_clip0_cap >  7673.50
|   |   |   |   |--- truncated branch of depth 2
|   |   |--- _job_exp_Y >  0.50
|   |   |   |--- log_wage_usd_yr <= 9.84
|   |   |   |   |--- truncated branch of depth 3
|   |   |   |--- log_wage_usd_yr >  9.84
|   |   |   |   |--- truncated branch of depth 3


=== AdaBoost (Undersampled) ===
No description has been provided for this image
=== AdaBoost (Undersampled) — Rules for Tree #0 (max_depth=3) ===
|--- _edu_High School <= 0.50
|   |--- class: 1
|--- _edu_High School >  0.50
|   |--- class: 0

No description has been provided for this image
=== HistGB (Undersampled) ===
[SKIP] Tree visualization not supported for HistGradientBoostingClassifier

Hyperparameter Tuning¶

Best practices for hyperparameter tuning in AdaBoost:

n_estimators:

  • Start with a specific number (50 is used in general) and increase in steps: 50, 75, 85, 100

  • Use fewer estimators (e.g., 50 to 100) if using complex base learners (like deeper decision trees)

  • Use more estimators (e.g., 100 to 150) when learning rate is low (e.g., 0.1 or lower)

  • Avoid very high values unless performance keeps improving on validation

learning_rate:

  • Common values to try: 1.0, 0.5, 0.1, 0.01

  • Use 1.0 for faster training, suitable for fewer estimators

  • Use 0.1 or 0.01 when using more estimators to improve generalization

  • Avoid very small values (< 0.01) unless you plan to use many estimators (e.g., >500) and have sufficient data


Best practices for hyperparameter tuning in Random Forest:

n_estimators:

  • Start with a specific number (50 is used in general) and increase in steps: 50, 75, 100, 125
  • Higher values generally improve performance but increase training time
  • Use 100-150 for large datasets or when variance is high

min_samples_leaf:

  • Try values like: 1, 2, 4, 5, 10
  • Higher values reduce model complexity and help prevent overfitting
  • Use 1–2 for low-bias models, higher (like 5 or 10) for more regularized models
  • Works well in noisy datasets to smooth predictions

max_features:

  • Try values: "sqrt" (default for classification), "log2", None, or float values (e.g., 0.3, 0.5)
  • "sqrt" balances between diversity and performance for classification tasks
  • Lower values (e.g., 0.3) increase tree diversity, reducing overfitting
  • Higher values (closer to 1.0) may capture more interactions but risk overfitting

max_samples (for bootstrap sampling):

  • Try float values between 0.5 to 1.0 or fixed integers
  • Use 0.6–0.9 to introduce randomness and reduce overfitting
  • Smaller values increase diversity between trees, improving generalization

Best practices for hyperparameter tuning in Gradient Boosting:

n_estimators:

  • Start with 100 (default) and increase: 100, 200, 300, 500
  • Typically, higher values lead to better performance, but they also increase training time
  • Use 200–500 for larger datasets or complex problems
  • Monitor validation performance to avoid overfitting, as too many estimators can degrade generalization

learning_rate:

  • Common values to try: 0.1, 0.05, 0.01, 0.005
  • Use lower values (e.g., 0.01 or 0.005) if you are using many estimators (e.g., > 200)
  • Higher learning rates (e.g., 0.1) can be used with fewer estimators for faster convergence
  • Always balance the learning rate with n_estimators to prevent overfitting or underfitting

subsample:

  • Common values: 0.7, 0.8, 0.9, 1.0
  • Use a value between 0.7 and 0.9 for improved generalization by introducing randomness
  • 1.0 uses the full dataset for each boosting round, potentially leading to overfitting
  • Reducing subsample can help reduce overfitting, especially in smaller datasets

max_features:

  • Common values: "sqrt", "log2", or float (e.g., 0.3, 0.5)
  • "sqrt" (default) works well for classification tasks
  • Lower values (e.g., 0.3) help reduce overfitting by limiting the number of features considered at each split

Best practices for hyperparameter tuning in XGBoost:

n_estimators:

  • Start with 50 and increase in steps: 50,75,100,125.
  • Use more estimators (e.g., 150-250) when using lower learning rates
  • Monitor validation performance
  • High values improve learning but increase training time

subsample:

  • Common values: 0.5, 0.7, 0.8, 1.0
  • Use 0.7–0.9 to introduce randomness and reduce overfitting
  • 1.0 uses the full dataset in each boosting round; may overfit on small datasets
  • Values < 0.5 are rarely useful unless dataset is very large

gamma:

  • Try values: 0 (default), 1, 3, 5, 8
  • Controls minimum loss reduction needed for a split
  • Higher values make the algorithm more conservative (i.e., fewer splits)
  • Use values > 0 to regularize and reduce overfitting, especially on noisy data

colsample_bytree:

  • Try values: 0.3, 0.5, 0.7, 1.0
  • Fraction of features sampled per tree
  • Lower values (e.g., 0.3 or 0.5) increase randomness and improve generalization
  • Use 1.0 when you want all features considered for every tree

colsample_bylevel:

  • Try values: 0.3, 0.5, 0.7, 1.0
  • Fraction of features sampled at each tree level (i.e., per split depth)
  • Lower values help in regularization and reducing overfitting
  • Often used in combination with colsample_bytree for fine control over feature sampling

In [61]:
# Hyperparameter Tuning — Top 3 models (tune once), calibrate, evaluate, visualize
assert "RNG" in globals() and "DF_TR" in globals() and "DF_HO" in globals()
assert "model_performance_classification_sklearn" in globals() and "confusion_matrix_sklearn" in globals()

from sklearn.model_selection import RandomizedSearchCV

y_tr = DF_TR["y"].astype(int).to_numpy()
y_ho = DF_HO["y"].astype(int).to_numpy()

# ---------- Helpers ----------
def pick_thr_precision_floor(y_true, y_prob, p_floor=0.80):
    y_true = np.asarray(y_true).astype(int); y_prob = np.asarray(y_prob).astype(float)
    prec, rec, thr = precision_recall_curve(y_true, y_prob)
    ok = np.where(prec[1:] >= p_floor)[0]
    if ok.size:
        cands = thr[ok]
        f1s = [f1_score(y_true, (y_prob >= t).astype(int), zero_division=0) for t in cands]
        return float(cands[int(np.argmax(f1s))])
    fpr, tpr, thr = roc_curve(y_true, y_prob)
    return float(thr[int(np.argmax(tpr - fpr))])

def eval_at(y_true, y_prob, thr):
    return model_performance_classification_sklearn(y_true, y_prob, threshold=thr)[0]

# Create OS/US TRAIN (if not already)
def _make_os(df):
    pos, neg = df[df["y"]==1], df[df["y"]==0]
    minority, majority = (neg, pos) if len(pos)>=len(neg) else (pos, neg)
    mi = minority.sample(n=len(majority), replace=True, random_state=RNG)
    return pd.concat([majority, mi], axis=0).sample(frac=1.0, random_state=RNG).reset_index(drop=True)

def _make_us(df):
    pos, neg = df[df["y"]==1], df[df["y"]==0]
    majority, minority = (pos, neg) if len(pos)>=len(neg) else (neg, pos)
    maj = majority.sample(n=len(minority), replace=False, random_state=RNG)
    return pd.concat([maj, minority], axis=0).sample(frac=1.0, random_state=RNG).reset_index(drop=True)

DF_TR_OS = globals().get("DF_TR_OS", _make_os(DF_TR))
DF_TR_US = globals().get("DF_TR_US", _make_us(DF_TR))

# Tree-style prep (NUM passthrough + OHE CAT)
CAT = [c for c in ["_edu","_continent","_region","_uow","_job_exp","_job_train","_full_time"] if c in DF_TR.columns]
NUM = [c for c in ["wage_usd_yr_cap","log_wage_usd_yr","employees_clip0_cap","log_employees","company_age","log_company_age"] if c in DF_TR.columns]
pp_trees = ColumnTransformer([("num","passthrough",NUM),
                              ("cat",OneHotEncoder(handle_unknown="ignore", sparse_output=True), CAT)],
                             remainder="drop", verbose_feature_names_out=False)

# ---------- Choose your top 3 (adjust if desired) ----------
# ('model_name', 'setting'): setting ∈ {'orig','os','us'}
CANDIDATES = [
    ("HistGB",   "orig"),  # HistGB on Original
    ("HistGB",   "us"),    # HistGB on Undersampled
    ("Bagging",  "os"),    # Bagging on Oversampled
]

def build_pipeline(name, setting):
    # class_weight only on original to respect natural skew
    cw_tree = "balanced" if setting=="orig" else None
    cw_rf   = "balanced_subsample" if setting=="orig" else None

    if name=="HistGB":
        est = HistGradientBoostingClassifier(random_state=RNG, class_weight=cw_tree)
    elif name=="Bagging":
        est = BaggingClassifier(
            estimator=DecisionTreeClassifier(max_depth=6, class_weight=cw_tree, random_state=RNG),
            n_estimators=200, n_jobs=-1, random_state=RNG
        )
    elif name=="RandomForest":
        est = RandomForestClassifier(n_estimators=600, random_state=RNG, n_jobs=-1, class_weight=cw_rf)
    elif name=="DecisionTree":
        est = DecisionTreeClassifier(max_depth=6, class_weight=cw_tree, random_state=RNG)
    elif name=="AdaBoost":
        est = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1, class_weight=cw_tree, random_state=RNG),
                                 random_state=RNG)
    else:
        raise ValueError(name)
    return Pipeline([("prep", pp_trees), ("model", est)])

# Small, sane grids (one pass)
GRIDS = {
    "HistGB": {
        "model__learning_rate":[0.03,0.06,0.1],
        "model__max_depth":[None,6,10],
        "model__l2_regularization":[0.0,0.1,0.5],
        "model__max_iter":[400,600,800],
    },
    "Bagging": {
        "model__n_estimators":[150,250,350],
        # base tree depth tweak via replacement if desired:
        # "model__base_estimator__max_depth":[4,6,8],
    },
    "RandomForest": {
        "model__n_estimators":[400,600,800],
        "model__max_depth":[None,12,18],
        "model__min_samples_leaf":[1,3,10],
    },
    "DecisionTree": {
        "model__max_depth":[3,6,9,12],
        "model__min_samples_leaf":[1,5,20],
    },
    "AdaBoost": {
        "model__n_estimators":[50,75,100,150],
        "model__learning_rate":[1.0,0.5,0.1],
    },
}

def get_train(sett):
    if sett=="orig": return DF_TR, y_tr
    if sett=="os":   return DF_TR_OS, DF_TR_OS["y"].astype(int).to_numpy()
    if sett=="us":   return DF_TR_US, DF_TR_US["y"].astype(int).to_numpy()
    raise ValueError(sett)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RNG)

rows = []
viz_conf = []  # store (label, probs, thr) for confusion plots

for name, setting in CANDIDATES:
    Xtr, ytr = get_train(setting)
    pipe = build_pipeline(name, setting)
    grid = GRIDS[name]
    # Randomized search for larger grids, else grid search
    n_cands = np.prod([len(v) for v in grid.values()])
    search = (RandomizedSearchCV(pipe, grid, n_iter=min(20, n_cands), scoring="roc_auc",
                                 cv=cv, n_jobs=-1, random_state=RNG, refit=True, verbose=0)
              if n_cands > 30 else
              GridSearchCV(pipe, grid, scoring="roc_auc", cv=cv, n_jobs=-1, refit=True, verbose=0))

    # Fit
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=FutureWarning)
        search.fit(Xtr, ytr)

    # Calibrate on ORIGINAL TRAIN for consistent policy & thresholding
    cal = CalibratedClassifierCV(search.best_estimator_, method="sigmoid", cv=5).fit(DF_TR, y_tr)
    thr = pick_thr_precision_floor(y_tr, cal.predict_proba(DF_TR)[:,1], p_floor=0.80)
    proba_ho = cal.predict_proba(DF_HO)[:,1]
    perf = eval_at(y_ho, proba_ho, thr)
    perf.update(model=name, setting=setting, best_params=search.best_params_)
    rows.append(perf)
    viz_conf.append((f"{name} ({setting})", proba_ho, thr))

TUNED3 = (pd.DataFrame(rows)
          .loc[:, ["model","setting","best_params","roc_auc","pr_auc","accuracy","precision","recall","f1","brier","log_loss","threshold"]]
          .sort_values(["roc_auc","pr_auc"], ascending=False)
          .reset_index(drop=True).round(4))

print("=== Tuned top-3 — calibrated, policy thresholded (HOLD-OUT) ===")
display(TUNED3)

# Confusion matrices for each tuned model
for lbl, ph, thr in viz_conf:
    print(f"\nConfusion — {lbl} @ thr={thr:.3f}")
    confusion_matrix_sklearn(y_ho, ph, threshold=thr, labels=("Denied","Certified"))
=== Tuned top-3 — calibrated, policy thresholded (HOLD-OUT) ===
model setting best_params roc_auc pr_auc accuracy precision recall f1 brier log_loss threshold
0 HistGB orig {'model__max_iter': 400, 'model__max_depth': 1... 0.7792 0.8751 0.7396 0.7913 0.8287 0.8096 0.1739 0.5167 0.5730
1 HistGB us {'model__max_iter': 600, 'model__max_depth': 1... 0.7788 0.8749 0.7402 0.7900 0.8323 0.8106 0.1738 0.5168 0.5683
2 Bagging os {'model__n_estimators': 350} 0.7717 0.8697 0.7392 0.7909 0.8287 0.8094 0.1785 0.5331 0.5447
Confusion — HistGB (orig) @ thr=0.573
No description has been provided for this image
Confusion — HistGB (us) @ thr=0.568
No description has been provided for this image
Confusion — Bagging (os) @ thr=0.545
No description has been provided for this image

Model Performance Summary and Final Model Selection¶

Selected HistGradientBoosting Original, tuned as the final model as it achieved:

  • ROC-AUC = 0.7792

  • PR-AUC = 0.8751 on the hold-out

  • Precision = 0.791

  • Recall = 0.829

  • F1 = 0.810

  • Threshold = 0.573 calibrated with threshold set on TRAIN with precision great than or equal to 0.80

The tuned US and OS variants were close but did not exceed this AUC kept the calibrated probabilities and the precision-floor policy for deployment.

Actionable Insights and Recommendations¶

Recommendation: prioritize applicants with higher education such as Master or Doctorate as well as yearly wage unit and strong company profiles with size and age these factors drive certification odds.

Power Ahead